We will use the present dataset to gain insights that could help make improvements against the flights delaying's, or to make backed findings about the best carriers with less delaying's.
This dataset consisting of 7,009,724 rows or data points after removing 4 duplicated points, reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, during year 2008. I reduced the dataset from 29 to 19 to be 9 columns or features eventually, and I got the carrier names instead of there codes from an other file called carriers.csv. I exported the columns to be worked with into 2008_flights.csv after some structuring with SQL, then I came back again to jupyter notebook to complete the work.
❯ Investigation points
# times of flight cancellation causes for year 2008 (7,009,724 rows)
# A = carrier, B = weather, C = NAS, D = security
plt.figure(figsize=(20,8))
default_color = sns.color_palette()[0]
sns.set(style='white')
ax = sns.countplot(data = df_inv, x = 'CancellationCode',
order = df_inv['CancellationCode'].value_counts().sort_index().index,
color = default_color,palette='muted')
ax.set_box_aspect(4/len(ax.patches)) # change 5 (numerator) to modify the y/x axis ratio
sns.despine(left=True,bottom=True)
plt.title("Flights-cancellation causes frequencies \nfor year 2008",
fontsize=21,
color='#BF1B00',
weight='bold',y=1.04)
#plt.xlabel('Cause',fontsize=16.5,color='blue')
#plt.ylabel('Freq.\n ',fontsize=16.5,color='blue')
plt.xlabel(''),plt.ylabel('')
#plt.yticks([0,10000,20000,30000,40000,50000],[0,'1e4','2e4','3e4','4e4','5e4'],fontsize=12.5,weight='bold')
plt.yticks([])
# Recalculating the causes_counts
causes_counts = df_inv['CancellationCode'].value_counts().sort_index()
# get the current tick locations and labels
locs, labels = plt.xticks(fontsize=15,weight='bold')
# loop through each pair of locations and labels
text_colors=(color for color in list(sns.color_palette()))#,
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = causes_counts[label.get_text()]
# print the annotation just above the top of the bar
plt.text(loc, count+600, s=f"{count:,}", ha = 'center',fontsize=15,weight='bold',color=next(text_colors))
ax.set_xticklabels(causes)
# coloring xticks labels
colors = [color for color in list(sns.color_palette())]
for xtick, color in zip(ax.get_xticklabels(), colors):
xtick.set_color(color);
# polished histogram
sns.set_theme(style='ticks')
plt.figure(figsize=(15,6))
sns.histplot(df_inv_15['ArrDelay'],bins=155,kde=True,
color=sns.color_palette()[9],
edgecolor=sns.color_palette()[0], linewidth=1) #,
#hist_kws={'histtype':'bar','alpha':0.1875}),color='skyblue'
sns.despine()
plt.title('Arrival delays (15 min. or more) distribution \nfor the year 2008 flights\n (limiting to 500 min.)',
fontsize=18,color='#BF1B00',weight='bold')
plt.xlabel('Arrival delay (min.)',color='blue',fontsize=16)
plt.ylabel('Freq.',color='blue',fontsize=16);
plt.xlim([0,500])
plt.xticks([0, 15, 30,46,62,78,93,109,125,141,157,200,215,230,250,300,315,330,350,400,500],
fontsize=13,weight='bold')
plt.yticks(np.arange(0,7,1)*10**5,
[0]+[str(n)+'e5' for n in range(1,7,1)],
#[0,'1e5','2e5','3e5','4e5','5e5','6e5'],
fontsize=13,weight='bold');
# finding arrival-delays outliers
delays = [2500,2000,1500,1000,500,400,350,250]
print(f"\nFrom {len(df_inv_15):,} flights of 15 minutes arrival delay or more:\n*****")
for delay in delays[::-1]:
print(f"'{df_inv_15[df_inv_15.ArrDelay > delay].shape[0]:,}' flights",
'exceeded "{}" min. arrival delay'.format(delay))
# print('Flights that exceeded "{}" minutes arrival delay are'.format(delay),
# f"'{df_inv_15[df_inv_15.ArrDelay > delay].shape[0]:,}' from",
# f"{len(df_inv_15):,} flights of 15 minutes arrival delay or more.")
From 1,524,733 flights of 15 minutes arrival delay or more: ***** '20,726' flights exceeded "250" min. arrival delay '5,824' flights exceeded "350" min. arrival delay '3,439' flights exceeded "400" min. arrival delay '1,562' flights exceeded "500" min. arrival delay '189' flights exceeded "1000" min. arrival delay '9' flights exceeded "1500" min. arrival delay '2' flights exceeded "2000" min. arrival delay '0' flights exceeded "2500" min. arrival delay
# univariate boxplots of 'delay-causes columns' for all flights data
plt.figure(figsize=(15,8))
plt.subplot(121)
sns.set(font_scale=1.19,style='ticks')
# main plot
ax = sns.boxplot(data= df_inv_15,
y='ArrDelay',)#palette='pastel')
plt.title('Arrival delays (15 min. or more) boxplot \nfor the year 2008 flights',
fontsize=21,
color='#BF1B00',
weight='bold')
plt.ylabel('Arrival delay (min.)',fontsize=18,color='blue')
plt.xlabel('',fontsize=18,color='blue')
#ax.set_xticklabels(causes)
plt.yticks(np.arange(0,3000,500),
[0] + [str(n)+'k' for n in np.arange(0.5,3,0.5)],
#[0] + [str(n/10)+'k' for n in range(5,30,5)],
fontsize=14)
sns.despine(left=True)
ax.grid(True,axis ='y')
# second plot for focusing on IQR rectangles and limiting arrival delay axis to 150 minutes
plt.subplot(122)
sns.set(font_scale=1.19,style='white')
ax = sns.boxplot(data= df_inv_15,
y='ArrDelay',palette='pastel')
sns.despine(bottom=True)
plt.title('Highlighting IQR rectangle \non the left boxplot',fontsize=21,color='#AD1FFF')
plt.xlabel('')
plt.ylabel('')
ax.set_xticks([])
plt.yticks(fontsize=18)
ax.set_yticks([0,15,23,37,69,138,150])
ax.set_ylim([0,150])
ax.grid(True,axis ='y');
plt.subplots_adjust(wspace=.2);
# univariate boxplots of 'delay-causes columns' for all flights data
plt.figure(figsize=(15,8))
plt.subplot(121)
sns.set(font_scale=1.19,style='ticks')
# main plot
ax = sns.violinplot(data= df_inv_15,
y='ArrDelay',)#palette='pastel')
plt.title('Arrival delays (15 min. or more) violinplot \nfor the year 2008 flights',
fontsize=21,
color='#BF1B00',
weight='bold')
plt.ylabel('Arrival delay (min.)',fontsize=18,color='blue')
plt.xlabel('',fontsize=18,color='blue')
#ax.set_xticklabels(causes)
plt.yticks(np.arange(0,3000,500),
[0] + [str(n)+'k' for n in np.arange(0.5,3,0.5)],
#[0] + [str(n/10)+'k' for n in range(5,30,5)],
fontsize=14)
sns.despine(left=True)
ax.grid(True,axis ='y')
# second plot for focusing on IQR rectangles and limiting arrival delay axis to 150 minutes
plt.subplot(122)
sns.set(font_scale=1.19,style='white')
ax = sns.violinplot(data= df_inv_15,
y='ArrDelay',palette='pastel')
sns.despine(bottom=True)
plt.title('Highlighting the distribution around the median \non the left violinplot',fontsize=21,color='#AD1FFF')
plt.xlabel('')
plt.ylabel('')
ax.set_xticks([])
plt.yticks(fontsize=18)
ax.set_yticks([0,15,23,37,69,138,150])
ax.set_ylim([0,150])
ax.grid(True,axis ='y');
plt.subplots_adjust(wspace=.2);
# univariate "Arrival Delays" violinplot for flights sample data (500 rows)
g = sns.FacetGrid(data = melt_samp_15,
#col = 'Cause',hue='Cause',
ylim=(-25,600),height=9)
title = 'Arrival delays (15 min. or more) distribution \nfor the year 2008 flights sampled data points'
g.fig.suptitle(title,
fontsize=21,
color='#BF1B00',
weight='bold')#,y=1.04)
sns.despine(left=True,bottom=True)
g.map(sns.violinplot,y=melt_samp_15['ArrDelay'])
# g.set_titles('{col_name}')
#g.set_titles('{col_name}')
#xticks = [-60,0,100,200,300,400]
#g.set(xticks=xticks)
#g.set_xticklabels(xticks,fontsize=13.5)
yticks= [-20]+list(np.arange(0,700,100))
#g.axes[0,0].set_yticks(yticks)
plt.yticks(yticks,yticks,
fontsize=16,weight='bold')
g.set_ylabels('Arrival delay (min.)',color='blue',fontsize=18,weight='bold')
#g.set_ylabels('');
#for i in range(4):
# g.axes[0,i].set_title(causes[i]+' delay (min.)',fontsize=18,color='blue')
# g.axes[0,i].grid(True);
# drawing quartiles lines using dictionary of lists
# creating lines dictionary
quartile_dict = {23 : ['25% (Q1 or First Quartile line) about 23' , sns.color_palette('dark')[2]], #'g'
37 : ['50% or Median line about 37', 'red'],
69 : ['75% (Q3 or Third Quartile line) about 69' , 'orange']}
#colors = (color for color in list(sns.color_palette()))
for yloc, l_list in quartile_dict.items():
#color = color=next(colors)#
plt.axhline(yloc,color=l_list[1])
plt.text(.52,yloc,l_list[0],color=l_list[1],fontsize=14.2,weight='bold')
# # drawing quartiles lines using dictionary of dictionaries (3 ways )
# creating lines dictionary
# quartile_dict = {23 : {'25% (Q1 or First Quartile line) about 23' : 'g'},
# 37 : {'50% or Median line = 37': 'red'},
# 69 : {'75% (Q3 or Third Quartile line) about 69' : 'orange'}}
# for yloc, l_dict in quartile_dict.items(): # --------------- first way
# key_str =list(l_dict.keys())[0]
# color=l_dict[key_str]
# plt.axhline(yloc,color=color)
# plt.text(.52,yloc,key_str,color=color)
# for yloc in quartile_dict: # --------------- second way (More general and short)
# for string,color in quartile_dict[yloc].items():
# plt.axhline(yloc,color=color)
# plt.text(.52,yloc,string,color=color)
# for yloc in quartile_dict: # --------------- third way
# for line in quartile_dict[yloc]:
# color = quartile_dict[yloc][line]
# plt.axhline(yloc,color=color)
# plt.text(.52,yloc,line,color=color)
# plotting delay causes distributions in df_inv_15 without zero or missing values (filtering from 1,524,733 rows)
# histograms
plt.figure(figsize=(20,5.95))
sns.set(style='ticks')
title='Causes-delays distributions \nfor the year 2008 flights of 15+ min. arrival delays'
plt.suptitle(title+'\n (independent axes)',y=1.04,
fontsize=21,color='black',# '#BF1B00',
weight='bold')
colors = (color for color in list(sns.color_palette('dark')))
for i,delay_col in enumerate(delay_causes_cols):
mask = df_inv_15[delay_col].isin(zero_or_null)
cause_col = df_inv_15[~mask][delay_col]
plt.subplot(1,4,i+1)
color = next(colors)
ax = sns.histplot(cause_col,bins=155,kde=False,color = color)
sns.despine()
plt.xlabel('\n'+causes[i]+' delay (min.)',color=color,fontsize=18,weight='bold')
plt.ylabel('Freq.',color=color,fontsize=18,weight='bold')
#plt.xlim((0,2500))
#xtick = [0,300,700,1300,1800,2400,3000,3800]
#plt.xticks(xtick,xtick,fontsize=13.5)
plt.xticks(fontsize=14,weight='bold',color=color)
#plt.ylim((0,300000))
#ytick = [0,450,850]
#plt.yticks(ytick,ytick,fontsize=13.5)
plt.yticks(fontsize=14,weight='bold',color=color)
plt.subplots_adjust(wspace=.3)
# plotting delay causes distributions in df_inv_15 without zero or missing values (filtering from 1,524,733 rows)
# limiting max minutes to 3000 to avoid a number of outliers
# histograms
plt.figure(figsize=(20,10))
sns.set(style='ticks')
title='Causes-delays distributions \nfor the year 2008 flights of 15+ min. arrival delays'
plt.suptitle(title+'\n\n(independent axes)',y=1.04,
fontsize=21,color='black',# '#BF1B00',
weight='bold')
colors = (color for color in sns.color_palette('dark'))
for i,delay_col in enumerate(delay_causes_cols):
mask = df_inv_15[delay_col].isin(zero_or_null)
cause_col = df_inv_15[~mask][delay_col]
plt.subplot(2,2,i+1)
color = next(colors)
ax = sns.histplot(cause_col,bins=155,kde=False,color = color)
sns.despine()
plt.xlabel('\n'+causes[i]+' delay (min.)',color=color,fontsize=18,weight='bold')
plt.ylabel('Freq.',color=color,fontsize=18,weight='bold')
#plt.xlim((0,2500))
#xtick = [0,300,700,1300,1800,2400,3000,3800]
#plt.xticks(xtick,xtick,fontsize=13.5)
plt.xticks(fontsize=14,weight='bold',color=color)
#plt.ylim((0,300000))
#ytick = [0,450,850]
#plt.yticks(ytick,ytick,fontsize=13.5)
plt.yticks(fontsize=14,weight='bold',color=color)
# adjusting yticklabels
if i == 0:
yticks = (ax.get_yticks()).astype('int')
yticklabels = [0] + [str(n/100000)+'e5' for n in yticks[1:]]
plt.yticks(ticks=yticks,
labels=yticklabels,
fontsize=14,weight='bold',color=color)
elif i == 1:
yticks = (ax.get_yticks()).astype('int')
yticklabels = [0] + [str(n/1000)+'e3' for n in yticks[1:9]] +['']
plt.yticks(ticks=yticks,
labels=yticklabels,
fontsize=14,weight='bold',color=color)
elif i == 2:
yticks = (ax.get_yticks()).astype('int')
yticklabels = [0] + [str(n/100000)+'e5' for n in yticks[1:6]] +['']
plt.yticks(ticks=yticks,
labels=yticklabels,
fontsize=14,weight='bold',color=color)
plt.grid(True,axis='y')
plt.subplots_adjust(hspace=.4,wspace=.3)
# finding zero delays or less of delay causes
print(f"\nFrom {len(df_inv_15):,} flights of 15 minutes arrival delay or more:\n*****")
for i,col in enumerate(delay_causes_cols):
outliers = df_inv_15[df_inv_15[col] <=0].shape[0]
print(f"'{outliers:,}' flights",
'had "{}" min. or less of {}-delay'.format(0,causes[i].replace('\n',' ').lower()))
print('-----------------------')
# finding causes-delay times outliers
delays = [2500,2000,1500,1000,500,400,350,250]
for i,col in enumerate(delay_causes_cols):
total = df_inv_15[~df_inv_15[col].isin(zero_or_null)]
print('\nFrom {:,} {}-delay times:\n*****'.format(len(total),causes[i].replace('\n',' ').lower()))
for delay in delays[::-1]:
outliers = df_inv_15[df_inv_15[col] > delay].shape[0]
print(f"'{outliers:,}' flights",
'exceeded "{}" min. {}-delay'.format(delay,causes[i].replace('\n',' ').lower()))
if outliers == 0:
break
From 1,524,733 flights of 15 minutes arrival delay or more: ***** '854,113' flights had "0" min. or less of carrier-delay '1,424,748' flights had "0" min. or less of weather-delay '596,703' flights had "0" min. or less of national air system-delay '1,518,531' flights had "0" min. or less of security-delay ----------------------- From 670,620 carrier-delay times: ***** '6,445' flights exceeded "250" min. carrier-delay '2,642' flights exceeded "350" min. carrier-delay '1,837' flights exceeded "400" min. carrier-delay '1,026' flights exceeded "500" min. carrier-delay '116' flights exceeded "1000" min. carrier-delay '6' flights exceeded "1500" min. carrier-delay '1' flights exceeded "2000" min. carrier-delay '0' flights exceeded "2500" min. carrier-delay From 99,985 weather-delay times: ***** '1,535' flights exceeded "250" min. weather-delay '500' flights exceeded "350" min. weather-delay '315' flights exceeded "400" min. weather-delay '156' flights exceeded "500" min. weather-delay '12' flights exceeded "1000" min. weather-delay '0' flights exceeded "1500" min. weather-delay From 928,030 national air system-delay times: ***** '3,299' flights exceeded "250" min. national air system-delay '721' flights exceeded "350" min. national air system-delay '359' flights exceeded "400" min. national air system-delay '100' flights exceeded "500" min. national air system-delay '9' flights exceeded "1000" min. national air system-delay '0' flights exceeded "1500" min. national air system-delay From 6,202 security-delay times: ***** '5' flights exceeded "250" min. security-delay '2' flights exceeded "350" min. security-delay '0' flights exceeded "400" min. security-delay
# delay-causes delays bivariate histogram for flights with 15+ minutes arrival delay
plt.figure(figsize=(15,5))
order = melt_15.Cause.unique()
ax = sns.histplot(data=melt_15,x='Minutes',bins=155,
hue='Cause',hue_order=order,
)#palette='bright')
sns.despine()
plt.xlim((0,250))
plt.title('Causes-delays (limited to 250 min.) frequencies\nfor year 2008 flights of 15+ min. arrival delay\n\n',
fontsize=21)
plt.xlabel('Delay (min.)',fontsize=18)
plt.ylabel('Freq.\n',fontsize=18)
plt.xticks(fontsize=14)
yticks = ax.get_yticks()
plt.yticks(yticks,
[0]+[str(n/100000) +'e5' for n in yticks[1:]],
fontsize=14);
# polished histogram
sns.set_theme(style='ticks')
plt.figure(figsize=(15,18))
title = 'Causes-delays distributions \nfor the year 2008 flights of 15 min. or more arrival delays'
plt.suptitle(title + '\n (Focusing on from 0 to 500 min. per cause)',
fontsize=21,color='#BF1B00',weight='bold',y=.94)
colors = (color for color in list(sns.color_palette('dark')))
for i,delay_col in enumerate(delay_causes_cols):
mask = df_inv_15[delay_col].isin(zero_or_null)
cause_col = df_inv_15[~mask][delay_col]
plt.subplot(4,1,i+1)
sns.histplot(cause_col,bins=70,kde=True,
color=sns.color_palette()[i])#[9])#,hist_kws={'histtype':'bar','alpha':0.1875}),color='skyblue'
sns.despine()
plt.xlabel(causes[i]+' delay (min.)',color=next(colors),fontsize=15)
plt.ylabel('Freq.',color='blue',fontsize=15);
plt.xlim([0,500])
plt.xticks([0, 20, 37,62,78,93,109,125,141,157,200,215,230,250,300,315,330,350,400,500],
fontsize='13',weight='bold')
plt.ylim((0,700000))
plt.yticks(np.arange(0,7,1)*10**5,
[0]+[str(n)+'e5' for n in range(1,7,1)],
#[0,'1e5','2e5','3e5','4e5','5e5','6e5'],
fontsize='13',weight='bold')
plt.subplots_adjust(hspace=.3);
# plotting Delay causes distributions for sampled flights (500 rows)
# with limiting up to 80 minutes for delay cause in the analysed sample
# histplots
sns.set_theme(font_scale=1.21,style='ticks')
g = sns.FacetGrid(data = melt_samp_15 ,
col = 'Cause',hue='Cause',palette='dark',
xlim=(0,150),
#height=3.9,aspect=.90)
height=5,aspect=.77)
title = 'Causes-delays distributions for sampled flights data points \nof 15 min. or more arrival delays'
g.fig.suptitle(title + '\n (limiting to 150 min.)',
fontsize=18,color='#BF1B00',weight='bold')
g.map(sns.histplot,'Minutes',bins=35,kde=True)
#g.set_titles('{col_name}')
g.set_titles('')
xticks = [0,12,25,37,50,62,75,88,100,125,150]
g.set(xticks=xticks)
colors = (color for color in list(sns.color_palette('dark')))
for i in range(4):
g.axes[0,i].set_xlabel(causes[i]+' delay (min.)',color=next(colors),fontsize=15)
g.axes[0,i].grid(True)
g.set_ylabels('Freq.',color='blue',fontsize=15);
# bivarite bar plot for delay causes time means
# for all flights of 15 minutes or more arrival delays (1,524,733 rows)
plt.figure(figsize=(15,8))
sns.set(font_scale=1.19,style='white')
ax = sns.barplot(data = melt_15, x='Cause',y='Minutes', palette='pastel')
ax.set_box_aspect(5/len(ax.patches)) # change 5 (numerator) to modify the y/x axis ratio
sns.despine(left=True)
plt.title('Means of causes-delays \nfor the year 2008',fontsize=18,
color='#BF1B00',weight='bold',y=1.04)
plt.xlabel('Cause',fontsize=15)
plt.ylabel('Delay mean (min.)',fontsize=15)
#ytick = [0,20,40]
plt.yticks([])
#plt.grid(True)
# get the current tick locations and labels
locs, labels = plt.xticks()
# grouping melt_15 to calculate columns means by delay causes
means_15 = melt_15.groupby('Cause').mean()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct mean
mean_15 = means_15.loc[label.get_text(),'Minutes']
# print the annotation just above the top of the bar
plt.text(loc+.1, mean_15+.2, s='{:0.2f} min.'.format(mean_15), ha = 'left', color = 'black',weight='bold')
ax.set_xticklabels(causes,weight='bold')
# coloring xticks labels
colors = [color for color in list(sns.color_palette('dark'))]
for xtick, color in zip(ax.get_xticklabels(), colors):
xtick.set_color(color);
# univariate boxplots of 'delay-causes columns' for all flights data
plt.figure(figsize=(15,8))
plt.subplot(121)
sns.set(font_scale=1.19,style='ticks')
# main plot
ax = sns.boxplot(data= melt_15,
y='Minutes',x='Cause',)#palette='pastel')
plt.title('Causes-delays boxplots \nfor the year 2008 flights \nof 15 min. or more arrival delays',
fontsize=21,
color='#BF1B00',
weight='bold')
plt.ylabel('Delay (min.)',fontsize=18,color='blue')
plt.xlabel('Cause',fontsize=18,color='blue')
ax.set_xticklabels(causes)
plt.xticks(fontsize=15)
plt.yticks(np.arange(0,26,5)*100,
[0] + [str(n)+'k' for n in np.arange(0.5,2.6,0.5)],
fontsize=15)
sns.despine(left=True)
ax.grid(True,axis ='y')
# coloring xticks labels
colors = [color for color in list(sns.color_palette('dark'))]
for xtick, color in zip(ax.get_xticklabels(), colors):
xtick.set_color(color);
# second plot for focusing on IQR rectangles and limiting arrival delay axis to 150 minutes
plt.subplot(122)
sns.set(font_scale=1.19,style='white')
ax = sns.boxplot(data= melt_15,
y='Minutes',x='Cause',palette='pastel')
sns.despine(bottom=True)
plt.title('Highlighting IQR rectangles \non the left boxplots',fontsize=21,color='#AD1FFF')
plt.xlabel('')
plt.ylabel('')
ax.set_xticks([])
plt.yticks(fontsize=18)
ax.set_yticks([0,9,13,18,25,41,57,65,89,126,150])
ax.set_ylim([0,150])
ax.grid(True,axis ='y');
plt.subplots_adjust(wspace=.2);
# univariate violinplots of 'delay-causes columns' for all flights data
plt.figure(figsize=(15,8))
plt.subplot(121)
sns.set(font_scale=1.19,style='ticks')
# main plot
ax = sns.violinplot(data= melt_15,
y='Minutes',x='Cause',palette='deep')
plt.title('Causes-delays violinplots \nfor the year 2008 flights \nof 15 min. or more delays',
fontsize=21,
color='#BF1B00',
weight='bold',y=1.04)
sns.despine(left=True)
plt.xlabel('Cause',fontsize=18,color='blue')
ax.set_xticklabels(causes)
plt.xticks(fontsize=14)
plt.ylabel('Delay (min.)',fontsize=18,color='blue')
plt.yticks(np.arange(0,26,5)*100,
[0] + [str(n)+'k' for n in np.arange(0.5,2.6,0.5)],
fontsize=15)
ax.grid(True,axis ='y')
# coloring xticks labels
colors = [color for color in list(sns.color_palette('dark'))]
for xtick, color in zip(ax.get_xticklabels(), colors):
xtick.set_color(color);
# second plot for focusing on IQR rectangles and limiting arrival delay axis to 150 minutes # ------------
plt.subplot(122)
sns.set(font_scale=1.19,style='white')
ax = sns.violinplot(data= melt_15,
y='Minutes',x='Cause',palette='deep')
sns.despine(bottom=True)
plt.title('Highlighting the distributions around the medians \non the left violinplots',fontsize=21,color='#AD1FFF')
plt.xlabel('')
plt.ylabel('')
ax.set_xticks([])
ax.set_yticks([-13,0,9,13,18,25,41,57,65,89,126,150])
ax.set_ylim([-15,150])
plt.yticks(fontsize=18)
ax.grid(True,axis ='y');
plt.subplots_adjust(wspace=.2);
# plotting faceted univariate violinplots 'delay-causes columns' for flights sample data (500 rows)
g = sns.FacetGrid(data = melt_samp_15,
col = 'Cause',hue='Cause',
height=4.5,aspect=.9)
title = 'Causes-delays distributions \nfor the year 2008 flights sampled data points \nof 15 min. or more arrival delays'
g.fig.suptitle(title,
fontsize=21,
color='#BF1B00',
weight='bold',y=1.04)
sns.despine(left=True,bottom=True)
g.map(sns.violinplot,'Minutes')
# g.set_titles('{col_name}')
g.set_titles('')
xticks = [-60,0,100,200,300,400]
g.set(xticks=xticks)
g.set_xticklabels(xticks,fontsize=13.5)
g.set_ylabels('')
# coloring xticks labels
colors = (color for color in list(sns.color_palette('dark')))
for i in range(4):
g.axes[0,i].set_xlabel(causes[i]+' delay (min.)',fontsize=18,color=next(colors))#sns.color_palette()[i])
g.axes[0,i].grid(True);
# bivarite bar plot for delay causes time means
# for all flights of 15 minutes or more arrival delays (1,524,733 rows)
plt.figure(figsize=(15,8))
sns.set(font_scale=1.19,style='ticks')
ax = sns.barplot(data = melt_15, x='Cause',y='ArrDelay',)# palette='pastel')
ax.set_box_aspect(5/len(ax.patches)) # change 5 (numerator) to modify the y/x axis ratio
sns.despine(left=True)
plt.title('Arrival delays (15 min. or more) means Vs. Causes \nfor the year 2008',fontsize=18,
color='#BF1B00',weight='bold',y=1.04)
plt.xlabel('Cause',fontsize = 14,weight='bold')
plt.ylabel('Arrival delay mean (min.)',fontsize = 14,weight='bold')
#ytick = [0,20,40,60,80]
#plt.yticks(ytick,ytick)
plt.yticks([])
#plt.grid(True)
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct mean
mean_15 = means_15.loc[label.get_text(),'ArrDelay']
# print the annotation just above the top of the bar
plt.text(loc+.1, mean_15+.2, s='{:0.2f} min.'.format(mean_15), ha = 'left', color = 'black',weight='bold')
ax.set_xticklabels(causes,fontsize=13)
# coloring xticks labels
colors = [color for color in list(sns.color_palette('dark'))]
for xtick, color in zip(ax.get_xticklabels(), colors):
xtick.set_color(color);
# boxplot of ArrDelay against delay causes
# for all data points df_inv_15 (1,524,733 rows)
plt.figure(figsize=(15,9))
plt.subplot(121)
sns.set(font_scale=1.19,style='ticks')
# main plot
ax = sns.boxplot(data= melt_15,
y='ArrDelay',
x='Cause',palette='bright')
sns.despine(left=True)
ax.grid(True,axis ='y')
plt.title('Arrival delays (15 min. or more) Vs. Delay causes \nfor the year 2008 flights',
fontsize=21,color='#BF1B00',weight='bold',y=1.04)
plt.ylabel('Arrival delay (min.)',fontsize=18,color='blue')
plt.xlabel('Delay cause',fontsize=18,color='blue')
ax.set_ylim([0,2500])
yticks=[0,100,200,400,2500]
plt.yticks(yticks,
[0,100,200,400,'2.5k'],
fontsize=15)
ax.set_xticklabels(causes)
plt.xticks(fontsize=14)
# coloring xticks labels
colors = [color for color in list(sns.color_palette('dark'))]
for xtick, color in zip(ax.get_xticklabels(), colors):
xtick.set_color(color);
# second plot for focusing on IQR rectangles and limiting arrival delay axis to 150 minutes
plt.subplot(122)
sns.set(font_scale=1.19,style='white')
ax = sns.boxplot(data= melt_15,
y='ArrDelay',
x='Cause',palette='bright')
sns.despine(bottom=True)
plt.title('Highlighting IQR rectangles \non the left boxplots',fontsize=21,color='#AD1FFF')
plt.xlabel('',fontsize=18)
plt.ylabel('')
plt.yticks(fontsize=18)
ax.set_yticks([15,20,24,28,35,38,45,62,69,82,109,136,150])
ax.set_ylim([10,140])
ax.set_xticks([])
ax.grid(True,axis ='y');
plt.subplots_adjust(wspace=.2);
# bivariate violinplots of ArrDelay against delay causes
# for all filghts data points df_inv_15 (1,524,733 rows)
plt.figure(figsize=(15,8))
plt.subplot(121)
sns.set(font_scale=1.19,style='ticks')
# main violinplot
ax = sns.violinplot(data= melt_15,
y='ArrDelay',x='Cause',palette='bright')
sns.despine(left=True)
ax.grid(True,axis ='y')
plt.title('Arrival delays (15 min. or more) Vs. Delay causes \nfor the year 2008 flights',
fontsize=21,color='#BF1B00',weight='bold',y=1.04)
plt.ylabel('Arrival delay (min.)',fontsize=18,color='blue')
plt.xlabel('Delay cause',fontsize=18,color='blue')
ax.set_ylim([0,2500])
yticks=[0,100,200,400,2500]
plt.yticks(yticks,
[0,100,200,400,'2.5k'],
fontsize=15)
ax.set_xticklabels(causes)
plt.xticks(fontsize=14)
# coloring xticks labels
colors = [color for color in list(sns.color_palette('dark'))]
for xtick, color in zip(ax.get_xticklabels(), colors):
xtick.set_color(color);
# second plot for focusing on IQR rectangles and limiting arrival delay axis to 150 minutes
plt.subplot(122)
sns.set(font_scale=1.19,style='white')
ax = sns.violinplot(data= melt_15,
y='ArrDelay',
x='Cause',palette='bright')
sns.despine(bottom=True)
plt.title('Highlighting the distributions around the medians \non the left violinplots',fontsize=21,color='#AD1FFF')
plt.xlabel('',fontsize=18,weight='bold')
plt.ylabel('')
plt.yticks(fontsize=18,weight='bold')
ax.set_yticks([0,15,28,38,45,62,69,82,109,136,150,225])
ax.set_ylim([0,225])
ax.set_xticks([])
ax.grid(True,axis ='y');
plt.subplots_adjust(wspace=.2);
# Different-causes delays Vs. Arrival delays for all flights with 15+ min. arrival delay (polished)
sns.set(font_scale=1.19,style='ticks')
g = sns.PairGrid(melt_15, x_vars='Minutes',y_vars='ArrDelay',
hue="Cause",#palette=sns.color_palette(n_colors=4),
height=5,aspect=1)
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
g.add_legend()
plt.title('Different-causes delays Vs. Arrival delays\nfor all flights with 15+ min. arrival delay\n')
plt.xlabel('Cause-delays (min.)')
plt.ylabel('Arrival delays (min.)')
yticklabels = [str(n/1000) + 'k' for n in g.axes[0][0].get_yticks()]
yticklabels = [label.replace('0.0k','0') for label in yticklabels]
g.axes[0,0].set_yticklabels(labels = yticklabels)
xticklabels = [str(n/1000) + 'k' for n in g.axes[0][0].get_xticks()]
xticklabels = [label.replace('0.0k','0') for label in xticklabels]
g.axes[0,0].set_xticklabels(labels = xticklabels);
# Arrival Delay Vs. delay causes
# for the flight data points of 15 minutes ArrDelay or more (1,524,733 rows)
# limiting to delays to 2500 minutes
sns.set(font_scale=1.19,style='ticks')
g = sns.PairGrid(data = df_inv_15[~zeros_mask],
x_vars = delay_causes_cols,
y_vars = 'ArrDelay',
despine=True,height=3.5)
title = 'Arrival delay (15 min. or more) Vs. delay causes delays \nfor the year 2008 flights'
plt.suptitle(title,fontsize=24,color='#BF1B00',weight='bold',y=1.15)
g.map(sns.regplot, scatter_kws={'alpha':1.85/10}, line_kws={"color": "#fe8589"});
g.set(xlim=(0,2500),ylim=(0,2500))
# labeling y axis, adjusting ytick labels
g.axes[0,0].set_ylabel('Arrival delay (min.)',color='blue',fontsize=16,weight='bold')
g.axes[0,0].set_yticks(ticks = np.arange(0,3,0.5)*1000)
yticklabels = [0] + [str(n/1000) + 'k' for n in g.axes[0][0].get_yticks()[1:]]
g.axes[0,0].set_yticklabels(labels = yticklabels,fontsize=15.5)
# labeling x axes, coloring x axes labels, adjusting xtick labels
colors = (color for color in list(sns.color_palette('dark')))
for i in range(4):
g.axes[0][i].set_xlabel(causes[i]+ ' delay (min.)',color=next(colors),weight='bold',fontsize=16)#;sns.color_palette()[i],
g.axes[0,i].set_xticks(ticks = np.arange(0,3,0.5)*1000)
xticklabels = [0] + [str(n/1000) + 'k' for n in g.axes[0][i].get_xticks()[1:]]
g.axes[0,i].set_xticklabels(xticklabels,fontsize=15.5);
# plots of Arrival Delay Vs. delay causes
# for the 500 flights data points sample df_inv_15_samp (500 rows)
# limiting visualizing delays to 250 minutes
#sns.set(font_scale=1.19,style='ticks')
g = sns.PairGrid(data = df_inv_15_samp[~samp_zeros_mask],
x_vars=delay_causes_cols,
y_vars = 'ArrDelay',
despine=True, height=4)
suptitle = 'Arrival delay (15 min. or more) Vs. delay causes delays \nfor the year 2008 flights sampled data points'
plt.suptitle(suptitle,fontsize=21,color='#BF1B00',weight='bold',y=1.15)
g.map(sns.regplot, scatter_kws={'alpha':1.85/10}, line_kws={"color": "orange"});
g.set(xlim=(0,300),ylim=(0,400))
g.axes[0,0].set_ylabel('Arrival delay (min.)',color='blue',weight='bold',fontsize=16)
for i in range(4):
g.axes[0][i].set_xlabel(causes[i]+ ' delay (min.)',color=sns.color_palette('dark')[i],weight='bold',fontsize=16);
# plots of Arrival Delay Vs. delay causes
# for the 500 flights data points sample df_inv_15_samp (500 rows)
# limiting visualizing delays to 250 minutes
# with arrival delay log scale
sns.set(font_scale=1.19,style='ticks')
g = sns.PairGrid(data = df_inv_15_samp[~samp_zeros_mask],
x_vars=delay_causes_cols,
y_vars = 'ArrDelay',
despine=True, height=4)
title = 'Arrival delay (15 min. or more) Vs. delay causes delays \nfor the year 2008 flights sampled data points'
plt.suptitle(title+'\n (log scaled arrival delay)',fontsize=21,color='#BF1B00',weight='bold',y=1.2)
g.map(sns.regplot, scatter_kws={'alpha':1.85/10}, line_kws={"color": "orange"});
# log scale
g.set(yscale='log',xlim=(0,350))#,xscale='log');
g.axes[0,0].set_ylabel('Arrival delay (log min.)',color='blue',weight='bold',fontsize=16)
for i in range(4):
g.axes[0][i].set_xlabel(causes[i]+ ' delay (min.) ',color=sns.color_palette('dark')[i],weight='bold',fontsize=16);
# plots of Arrival Delay Vs. delay causes
# for the 500 flights data points sample df_inv_15_samp (500 rows)
# limiting visualizing delays to 250 minutes
# with log scaled cause delay
sns.set(font_scale=1.19,style='ticks')
g = sns.PairGrid(data = df_inv_15_samp[~samp_zeros_mask],
x_vars=delay_causes_cols,
y_vars = 'ArrDelay',
despine=True, height=4)
title = 'Arrival delay (15 min. or more) Vs. delay causes delays \nfor the year 2008 flights sampled data points'
plt.suptitle(title+'\n (log scaled cause delay)',fontsize=21,color='#BF1B00',weight='bold',y=1.2)
g.map(sns.regplot, scatter_kws={'alpha':1.85/10}, line_kws={"color": "orange"});
# log scale
g.set(xscale='log')#,yscale='log');
g.axes[0,0].set_ylabel('Arrival delay (min.)',color='blue',weight='bold',fontsize=16)
for i in range(4):
g.axes[0][i].set_xlabel(causes[i]+ ' delay (log min.) ',color=sns.color_palette('dark')[i],weight='bold',fontsize=16);
# plots of Arrival Delay Vs. delay causes
# for the 500 flights data points sample df_inv_15_samp (500 rows)
# limiting visualizing delays to 250 minutes
# with log scaled axes
sns.set(font_scale=1.19,style='ticks')
g = sns.PairGrid(data = df_inv_15_samp[~samp_zeros_mask],
x_vars=delay_causes_cols,
y_vars = 'ArrDelay',
despine=True, height=4)
title = 'Arrival delay (15 min. or more) Vs. delay causes delays \nfor the year 2008 flights sampled data points'
plt.suptitle(title+'\n (log scaled axes)',fontsize=21,color='#BF1B00',weight='bold',y=1.2)
g.map(sns.regplot, scatter_kws={'alpha':1.85/10}, line_kws={"color": "orange"});
# log scale
g.set(xscale='log',yscale='log',xlim=(0,10**3))
g.axes[0,0].set_ylabel('Arrival delay (log min.)',color='blue',weight='bold',fontsize=16)
for i in range(4):
g.axes[0][i].set_xlabel(causes[i]+ ' delay (log min.) ',color=sns.color_palette('dark')[i],weight='bold',fontsize=16);
# correlation plot for ArrDelay and delay causes,
# for arrival delays of 15 minutes or more df_inv_15 (1,524,733 rows)
delays_cols = ['ArrDelay','CarrierDelay','WeatherDelay','NASDelay','SecurityDelay']
plt.figure(figsize = [10, 8])
ax = sns.heatmap(df_inv_15[~zeros_mask][delays_cols].corr(),
annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.title('Arrival delays (15 min. or more) and delay causes delays correlation\nfor the year 2008 flights',
fontsize=21,color='#AD1FFF',y=1.03)
ticklabels=['Arrival delay','Carrier delay','Weather delay','National Air System delay','Security delay']
ax.set_xticklabels(labels=ticklabels,rotation=90,fontsize=13,weight='bold')
ax.set_yticklabels(labels=ticklabels,rotation=0,fontsize=13,weight='bold');
# month_counts.to_frame('Month').rename(columns={'Month':'Flights counts'}).sort_values('Flights counts',ascending=False)
# calculating Flights proportions and counts per each month
# # creating 'prop' dataframe and sorting by total flights
# prop_month = df_inv['Month'].value_counts().to_frame('Flights').sort_values('Flights',ascending=False)
# prop_month['On-time'] = df_inv[(df_inv.ArrDelay < 15) & (~df_inv.ArrDelay.isna())]['Month'].value_counts()
# prop_month['On-time prop (per month flights)'] = round(100 * prop_month['On-time'] / prop_month['Flights'],2)
# prop_month['On-time prop (per month flights)'] = prop_month['On-time prop (per month flights)'].astype('str') + ' %'
# prop_month['Delayed'] = df_inv_15['Month'].value_counts()
# prop_month['Del. prop (per month flights)'] = round(100 * prop_month['Delayed'] / prop_month['Flights'],2)
# prop_month['Del. prop (per month flights)'] = prop_month['Del. prop (per month flights)'].astype('str') + ' %'
# prop_month['Cancelled'] = airline_cancelled['Month'].value_counts()
# prop_month['Canc. prop (per month flights)'] = round(100 * prop_month['Cancelled'] / prop_month['Flights'],2)
# prop_month['Canc. prop (per month flights)'] = prop_month['Canc. prop (per month flights)'].astype('str') + ' %'
# prop_month['Diverted'] = prop_month['Flights'] - (prop_month['On-time']+prop_month['Delayed']+prop_month['Cancelled'])
# prop_month['Div. prop (per month flights)'] = round(100 * prop_month['Diverted'] / prop_month['Flights'],2)
# prop_month['Div. prop (per month flights)'] = prop_month['Div. prop (per month flights)'].astype('str') + ' %'
#### plotting clustered bar chart for flights counts month
colors = [sns.color_palette()[0] , sns.color_palette()[2] , sns.color_palette()[1] , sns.color_palette()[3]]
prop_month.iloc[:,[0,1,3,5]].plot(kind='bar',figsize=(15,9),rot=0, color=colors)
sns.set(font_scale=1.19,style='ticks')
sns.despine(left=True,bottom=True)
plt.title('Flights frequencies by months \nfor the year 2008 flights\n(sorted from highest total flights)',
fontsize=21,color='#BF1B00',weight='bold',y=1)
plt.xlabel('Month',fontsize=17,weight='bold')
plt.ylabel('Freq.',fontsize=17,weight='bold')
plt.xticks(fontsize=14,weight='bold')
plt.yticks(np.arange(0,7,1)*10**5,
[0]+[str(n)+'e5' for n in range(1,7,1)],
#[0,'1e5','2e5','3e5','4e5','5e5','6e5'],
fontsize=14,weight='bold')
plt.legend(labels=['Total','On-time','Delayed','Cancelled'],
#ncol=4,
loc='center right',bbox_to_anchor=(1.2,.4),#,bbox_to_anchor=(.5,1.047),
prop={'size':18})
plt.grid(True,axis='y')
# drawing max and min lines
max_min_lines = {'on-time':(480269,356100),
'delayed':(168647,71149),
'cancelled':(20596,3249)}
colors = (color for color in colors[1:])
for m in max_min_lines:
color=next(colors)
max,min = max_min_lines[m]
plt.axhline(max ,color=color,linewidth=1.2, linestyle ='-')
plt.axhline(min ,color=color,linewidth=1.2, linestyle ='-');
#'-', '--', '-.', ':', 'None', ' ', '', 'solid', 'dashed', 'dashdot', 'dotted'
print('\nHint: Colored lines indicate the Max and Min frequencies')
Hint: Colored lines indicate the Max and Min frequencies
colors = [sns.color_palette('muted')[2],sns.color_palette('muted')[1],
sns.color_palette('muted')[3],sns.color_palette('muted')[-5]]
prop_month.iloc[:,[1,3,5,7]].plot(kind='bar',figsize=(15,9),rot=0,stacked=True, color=colors)
sns.set(font_scale=1.19,style='ticks')
sns.despine(left=True,bottom=True)
plt.title('Flights frequencies (stacked) by months \nfor the year 2008 flights\n(sorted from highest total flights)',
fontsize=22,color='#BF1B00',weight='bold',y=1.1)
plt.xlabel('Month',fontsize=17,weight='bold')
plt.ylabel('Freq.',fontsize=17,weight='bold')
plt.xticks(fontsize=14,weight='bold')
plt.yticks(np.arange(0,7,1)*10**5,
[0]+[str(n)+'e5' for n in range(1,7,1)],
#[0,'1e5','2e5','3e5','4e5','5e5','6e5'],
fontsize='14',weight='bold')
plt.legend(loc='upper center',bbox_to_anchor=(.5,1.07),
fontsize=13,framealpha =.9,
ncol=4,
#loc='center right',bbox_to_anchor=(1.2,.4),#,bbox_to_anchor=(.5,1.047),
prop={'size':20})
#plt.grid(True,axis='y')
#ax.xaxis.tick_top()
# drawing max and min lines
max_min_lines = {'on-time':(480269,356100),
'delayed':((168647 + 356100) , (71149+459127)) ,
'cancelled':((20596 + 156235 + 390686) , (3249 + 73517+478554))
}
color_gen = (color for color in colors)
for flights in max_min_lines:
max,min = max_min_lines[flights]
color=next(color_gen)
plt.axhline(max,color=color,xmin=.55,linewidth=1.4,linestyle ='-')
plt.text(11.6,max,'Max value line',color=color,va='center')
plt.axhline(min,color=color,xmin=.68,xmax=.899,linewidth=1.4,linestyle ='-')
plt.text(10.35,min+3500,'Min value line',color='black',va='center')
print('\nHint: Colored lines indicate the Max and Min frequencies')
Hint: Colored lines indicate the Max and Min frequencies
# Months against flights counts
# for the whole data points df_inv (7,009,728 rows), against on_time (5,330,292 rows),
# df_inv_15 (1,524,733 rows) , airline_cancelled (137,434 rows)
fig , ax = plt.subplots(2,2,sharex=True,figsize=(15,18))
#ax = ax.flatten()
plt.suptitle('Months flights counts (from highest) and proportions \nfor the year 2008',
fontsize=22,color='#BF1B00',
#x=.45,
y=.955)#y=1.08)
# first subplot
# Total flights per months
ax[0,0] = plt.subplot(2,2,1)
month_counts = df_inv['Month'].value_counts()
sns.countplot(data = df_inv, y = 'Month',
order = month_counts.index,
color = sns.color_palette('rocket')[1],
#sns.color_palette('vlag')[-1],#sns.color_palette('rocket')[0],#default_color,
ax=ax[0,0])
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
plt.title('Totals \n[7,009,724 flights]',fontsize=21,color='#AD1FFF',y=1.023)#x=.45,y=1.0365)
plt.xlabel('')
plt.ylabel('Month',fontsize=18)#weight='bold',
plt.xticks([])
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = month_counts[label.get_text()]
percentage_total = 100*count/7009724
string = str(f"{count:,}") + "\n{:0.2f}%".format(percentage_total) #+ ' flights'
# print the annotation next to the bar
plt.text(count+3000, loc, s= string, va = 'center', color = 'black');
#########
# second subplot
# on-time flights by Months
ax[0,1] = plt.subplot(2,2,2)
month_counts = on_time['Month'].value_counts()
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
sns.countplot(data = on_time, y = 'Month',
order = month_counts.index,
color = sns.color_palette('vlag')[0],#sns.color_palette('viridis')[-3],# 'skyblue',
ax=ax[0,1])
plt.title('On-time \n[5,330,292 flights]',fontsize=21,color='#AD1FFF',y=1.023)#,x=.25
plt.xlabel('')
plt.ylabel('')
#plt.ylabel('Month',weight='bold',fontsize=18)
plt.xticks([])
plt.yticks(color = default_color)
# Recalculating the total flights counts for months
month_flights_count_whole = df_inv['Month'].value_counts() # to be used in calculating percentages of month flights
# counts with delays of 15 minutes or more
# against month total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = month_counts[label.get_text()]
count_whole = month_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/5330292
string = str(f"{count:,}") + " ({:0.2f}% of month's)\n{:0.2f}%".format(percentage_value,percentage_total)# flights
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black');
##################################
# third subplot
# delayed flights by Months
ax[1,0] = plt.subplot(2,2,3)
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
month_counts = df_inv_15['Month'].value_counts()
sns.countplot(data = df_inv_15, y = 'Month',
order = month_counts.index,
color = 'skyblue',
ax=ax[1,0])
plt.title('Delayed\n(15 min. arrival delay or more)\n[1,524,733 flights] ',fontsize=21,color='#AD1FFF',y=1.023)
#,x=.45,y=1.0365)
#,y=1.023)#,x=.25
plt.xlabel('')
plt.ylabel('')
plt.ylabel('Month',fontsize=18)#weight='bold',
plt.xticks([])
plt.yticks(color = 'brown')
# Recalculating the total flights counts for months
month_flights_count_whole = df_inv['Month'].value_counts() # to be used in calculating percentages of month flights
# counts with delays of 15 minutes or more
# against month total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = month_counts[label.get_text()]
count_whole = month_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/1524733
string = str(f"{count:,}") + " ({:0.2f}% of month's)\n{:0.2f}%".format(percentage_value,percentage_total)# flights
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black');
#################
# fourth subplot
# cancelled flights by Months
ax[1,1] = plt.subplot(2,2,4)
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
month_counts = airline_cancelled['Month'].value_counts()
sns.countplot(data = airline_cancelled ,
y = 'Month',
order = month_counts.index,
color = sns.color_palette()[1],
ax=ax[1,1])
plt.title('Cancelled \n[137,434 flights]',fontsize=21,color='#AD1FFF',y=1.0365)#x=.45,y=1.0365)#x=.205,y=1.0365)
plt.xlabel('')
plt.ylabel('')
#plt.ylabel('Month',weight='bold',fontsize=18)
plt.xticks([])
plt.yticks(color = sns.color_palette()[1],fontsize=12.5)
# Recalculating the total flights counts for months
month_flights_count_whole = df_inv['Month'].value_counts() # to be used in calculating percentages of month flights
# counts with delays of 15 minutes or more
# against month total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = month_counts[label.get_text()]
count_whole = month_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/137434
string = str(f"{count:,}") + " ({:0.2f}% of month's)\n{:0.2f}%".format(percentage_value,percentage_total)# flights
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black');#,fontsize=12.5);
#########
plt.subplots_adjust(wspace=0.35);
# Months against flights counts
# for the whole data points df_inv (7,009,728 rows),
# against df_inv_15 (1,524,733 rows) , airline_cancelled (137,434 rows)
default_color = sns.color_palette()[0]
fig, ax = plt.subplots(1, 3, sharex=True,figsize=(15,10))
plt.suptitle('Months flights counts (from highest) and proportions \nfor the year 2008',
fontsize=22,color='#BF1B00',
#x=.45,
y=1.08)
# first subplot
ax[0] = plt.subplot(1,3,1)
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
month_counts = airline_cancelled['Month'].value_counts()
sns.countplot(data = airline_cancelled ,
y = 'Month',
order = month_counts.index,
color = sns.color_palette()[1],
ax=ax[0])
plt.title('Cancelled \n[137,434 flights]',fontsize=21,color='#AD1FFF',x=.205,y=1.0365)
plt.xlabel('')
plt.ylabel('Month',weight='bold',fontsize=18)
plt.xticks([])
plt.yticks(color = sns.color_palette()[1],fontsize=12.5)
# Recalculating the total flights counts for months
month_flights_count_whole = df_inv['Month'].value_counts() # to be used in calculating percentages of month flights
# counts with delays of 15 minutes or more
# against month total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = month_counts[label.get_text()]
count_whole = month_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/137434
string = str(f"{count:,}") + " ({:0.2f}% of month's)\n{:0.2f}%".format(percentage_value,percentage_total)# flights
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black');#,fontsize=12.5);
##################################
# second subplot
ax[1] = plt.subplot(1,3,2)
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
month_counts = df_inv_15['Month'].value_counts()
sns.countplot(data = df_inv_15, y = 'Month',
order = month_counts.index,
color = 'skyblue',
ax=ax[1])
plt.title('Delayed \n[1,524,733 flights] \n(15 min. arrival delay or more)',fontsize=21,color='#AD1FFF',y=1.023)#,x=.25
plt.xlabel('')
plt.ylabel('')
#plt.ylabel('Carrier',weight='bold',fontsize=18)
plt.xticks([])
plt.yticks(color = 'brown')
# Recalculating the total flights counts for months
month_flights_count_whole = df_inv['Month'].value_counts() # to be used in calculating percentages of month flights
# counts with delays of 15 minutes or more
# against month total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = month_counts[label.get_text()]
count_whole = month_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/1524733
string = str(f"{count:,}") + " ({:0.2f}% of month's)\n{:0.2f}%".format(percentage_value,percentage_total)# flights
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black');
##################################
# third subplot
ax[2] = plt.subplot(1,3,3)
month_counts = df_inv['Month'].value_counts()
sns.countplot(data = df_inv, y = 'Month',
order = month_counts.index,
color = default_color,
ax=ax[2])
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
plt.title('Totals \n[7,009,724 flights]',fontsize=21,color='#AD1FFF',x=.45,y=1.0365)
plt.xlabel('')
plt.ylabel('')
plt.xticks([])
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = month_counts[label.get_text()]
percentage_total = 100*count/7009724
string = str(f"{count:,}") + "\n{:0.2f}%".format(percentage_total) #+ ' flights'
# print the annotation next to the bar
plt.text(count+3000, loc, s= string, va = 'center', color = 'black');
#########
plt.subplots_adjust(wspace=0.35);
#if percentage_value >;
#plt.yticks([])
# y=.99
# y=1.0365
# y=1.023
# y=1.0365
# Flights proportions by Months donut plots
plt.subplots(2,2)#,sharex=True,sharey=True)
plt.suptitle("Flights proportions \nby Months \nfor the year 2008",color='#BF1B00',fontsize=33,weight='bold', y=1)
plt.subplot(221)
# Total flights proportions by Months
plt.title("All \n[7,009,724 flights]",color='#AD1FFF',y=1.068, fontsize=30,weight='bold')
month_counts = df_inv['Month'].value_counts()
month_counts.plot(kind='pie',figsize=(23,20),pctdistance=1.073, labeldistance=1.197,wedgeprops = {'width' : .25},#radius=7.5,
label='',#labels=n_list,
counterclock=False,startangle=90,
#explode=(0.1, 0.1, 0.1, 0.1),
autopct="%1.2f%%", colors=sns.color_palette(),fontsize=20)
plt.axis("equal")
plt.subplot(222)
# Total on-time flights proportions by Months
plt.title("On-time \n[5,330,292 flights]" ,color='#AD1FFF', y=1.068,fontsize=28,weight='bold')
month_counts = on_time['Month'].value_counts()
month_counts.plot(kind='pie',figsize=(23,20),pctdistance=1.073, labeldistance=1.2,wedgeprops = {'width' : .25},##radius=7.5,
label='',#labels=n_list,
counterclock=False,startangle=90,
#explode=(0.1, 0.1, 0.1, 0.1),
autopct="%1.2f%%", colors=sns.color_palette(),fontsize=20)
plt.axis("equal");
plt.subplot(223)
# Total delayed flights proportions by Months
plt.title("Delayed (15 min. arrival delay or more) \n[1,524,733 flights]", color='#AD1FFF',y=1.068, fontsize=28,weight='bold')
month_counts = df_inv_15['Month'].value_counts()
month_counts.plot(kind='pie',figsize=(23,20),pctdistance=1.073, labeldistance=1.2,wedgeprops = {'width' : .25},##radius=7.5,
label='',#labels=n_list,
counterclock=False,startangle=90,
#explode=(0.1, 0.1, 0.1, 0.1),
autopct="%1.2f%%", colors=sns.color_palette(),fontsize=20)
plt.axis("equal")
plt.subplot(224)
# Total cancelled flights proportions by Months
plt.title("Cancelled \n[137,434 flights]" ,color='#AD1FFF', y=1.068,fontsize=28,weight='bold')
month_counts = airline_cancelled['Month'].value_counts()
month_counts.plot(kind='pie',figsize=(23,20),pctdistance=1.073, labeldistance=1.2,wedgeprops = {'width' : .25},##radius=7.5,
label='',#labels=n_list,
counterclock=False,startangle=90,
#explode=(0.1, 0.1, 0.1, 0.1),
autopct="%1.2f%%", colors=sns.color_palette(),fontsize=20)
plt.axis("equal")
plt.subplots_adjust(wspace=.15,hspace=.27);
print('\nYear 2008 flights aggregated counts and proportions per Months\n\n(sorted from the highest total flights of month)')
prop_month
# columns are sorted by total flights number of month
Year 2008 flights aggregated counts and proportions per Months (sorted from the highest total flights of month)
| Flights | On-time | On-time prop (per month flights) | Delayed | Del. prop (per month flights) | Cancelled | Canc. prop (per month flights) | Diverted | Div. prop (per month flights) | |
|---|---|---|---|---|---|---|---|---|---|
| Jul | 627931 | 475373 | 75.7 % | 140050 | 22.3 % | 10598 | 1.69 % | 1910 | 0.3 % |
| Mar | 616090 | 440973 | 71.58 % | 157370 | 25.54 % | 16183 | 2.63 % | 1564 | 0.25 % |
| Aug | 612279 | 480269 | 78.44 % | 120481 | 19.68 % | 9835 | 1.61 % | 1694 | 0.28 % |
| Jun | 608665 | 431198 | 70.84 % | 164260 | 26.99 % | 10931 | 1.8 % | 2276 | 0.37 % |
| May | 606293 | 479084 | 79.02 % | 120126 | 19.81 % | 6229 | 1.03 % | 854 | 0.14 % |
| Jan | 605765 | 438323 | 72.36 % | 148807 | 24.57 % | 17308 | 2.86 % | 1327 | 0.22 % |
| Apr | 598126 | 464554 | 77.67 % | 122169 | 20.43 % | 10355 | 1.73 % | 1048 | 0.18 % |
| Feb | 569232 | 390686 | 68.63 % | 156235 | 27.45 % | 20596 | 3.62 % | 1715 | 0.3 % |
| Oct | 556205 | 478554 | 86.04 % | 73517 | 13.22 % | 3249 | 0.58 % | 885 | 0.16 % |
| Dec | 544958 | 356100 | 65.34 % | 168647 | 30.95 % | 17779 | 3.26 % | 2432 | 0.45 % |
| Sep | 540908 | 459127 | 84.88 % | 71149 | 13.15 % | 9913 | 1.83 % | 719 | 0.13 % |
| Nov | 523272 | 436051 | 83.33 % | 81922 | 15.66 % | 4458 | 0.85 % | 841 | 0.16 % |
# Total flights proportions by Months
month_counts = df_inv['Month'].value_counts()
month_counts.plot(kind='pie',figsize=(10,10),radius=1.5,
label='',#labels=n_list,
counterclock=False,startangle=90,
#explode=(0.1, 0.1, 0.1, 0.1),
autopct="%1.2f%%", colors=sns.color_palette(),fontsize=18)
plt.title("Total flights proportions by Months \nfor the year 2008\n(Total 7,009,724)",y=1.04, color='#BF1B00', fontsize=24)
plt.axis("equal");
# Total flights proportions by Months
month_counts = on_time['Month'].value_counts()
month_counts.plot(kind='pie',figsize=(10,10),radius=1.5,
label='',#labels=n_list,
counterclock=False,startangle=90,
#explode=(0.1, 0.1, 0.1, 0.1),
autopct="%1.2f%%", colors=sns.color_palette(),fontsize=18)
plt.title("On-time flights proportions per Months \nfor the year 2008\n(Total 5,330,292)",y=1.04, color='#BF1B00', fontsize=24)
plt.axis("equal");
# Total delayed flights proportions by Months
month_counts = df_inv_15['Month'].value_counts()
month_counts.plot(kind='pie',figsize=(10,10),radius=1.5,
label='',#labels=n_list,
counterclock=False,startangle=90,
#explode=(0.1, 0.1, 0.1, 0.1),
autopct="%1.2f%%", colors=sns.color_palette(),fontsize=18)
plt.title("Delayed flights (15 min. arrival delay or more) proportions per Months \nfor the year 2008\n(Total 1,524,733)",
y=1.04, color='#BF1B00', fontsize=24)
plt.axis("equal");
# Total cancelled flights proportions by Months
month_counts = airline_cancelled['Month'].value_counts()
month_counts.plot(kind='pie',figsize=(10,10),radius=1.5,
label='',#labels=n_list,
counterclock=False,startangle=90,
#explode=(0.1, 0.1, 0.1, 0.1),
#pctdistance=.8,
autopct="%1.2f%%", colors=sns.color_palette(),fontsize=18)
plt.title("Cancelled flights proportions per months \nfor the year 2008\n(Total 137,434)",y=1.04, color='#BF1B00', fontsize=24)
plt.axis("equal");
# Months Vs. flight cancellation times
plt.figure(figsize=(18,8))
ax = sns.countplot(data=airline_cancelled,
x='Month',# order = airline_cancelled['Month'].value_counts().index,
#hue='CancellationCode',hue_order=['A','B','C','D'],
)#palette='pastel');
sns.despine()
title = 'Flights-cancellation frequencies per months \nfor the year 2008 flights'
plt.title(title + '\n(sorted upon months of the year)',
fontsize=21,color='#BF1B00',weight='bold',y=1.04)
plt.xlabel('Month',fontsize=18,weight='bold')
plt.ylabel('Freq.',fontsize=18,weight='bold')
plt.xticks(fontsize=13.5,weight='bold')
plt.yticks(np.arange(0,23,2)*1000,
[0]+[str(int(n))+'k' for n in np.arange(2,23,2)],
#[0,'2k','4k','6k','8k','10k','12k','14k','16k','18k','20k','22k'],
fontsize=13.5,weight='bold')
#plt.grid(True,axis='y')
months_canc = airline_cancelled['Month'].value_counts()
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = months_canc[label.get_text()]
percentage_total = 100*count/137434
string = "{:0.2f}%\n{:,}".format(percentage_total,count) #+ str(f"{count:,}")
# print the annotation next to the bar
plt.text(loc, count+300, s=string, ha = 'center', color = 'black',fontsize=13.5,weight='bold');
# Months Vs. flight cancellation times
plt.figure(figsize=(18,8))
ax = sns.countplot(data=airline_cancelled,
x='Month', order = airline_cancelled['Month'].value_counts().index,
#hue='CancellationCode',hue_order=['A','B','C','D'],
)#palette='pastel');
sns.despine()
plt.title('Flights-cancellation frequencies by months \nfor the year 2008 flights\n(sorted from highest)',
fontsize=21,color='#BF1B00',weight='bold',y=1.04)
plt.xlabel('Month',fontsize=18,weight='bold')
plt.ylabel('Freq.',fontsize=18,weight='bold')
plt.xticks(fontsize=13.5,weight='bold')
plt.yticks(np.arange(0,23,2) * 1000,
[0]+[str(n)+'k' for n in np.arange(2,23,2)],
#[0,'2k','4k','6k','8k','10k','12k','14k','16k','18k','20k','22k'],
fontsize=13.5,weight='bold')
#plt.grid(True,axis='y')
months_canc = airline_cancelled['Month'].value_counts()
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = months_canc[label.get_text()]
percentage_total = 100*count/137434
string = "{:0.2f}%\n{:,}".format(percentage_total,count) #+ str(f"{count:,}")
# print the annotation next to the bar
plt.text(loc, count+300, s=string, ha = 'center', color = 'black',fontsize=13.5,weight='bold');
print('\nYear 2008 cancelled flights aggregated data per Months\n\n(sorted from the highest total flights per month)')
prop_month.sort_values('Flights',ascending=False).iloc[:,[0,5,6]]
# columns are sorted by total flights number of month
Year 2008 cancelled flights aggregated data per Months (sorted from the highest total flights per month)
| Flights | Cancelled | Canc. prop (per month flights) | |
|---|---|---|---|
| Jul | 627931 | 10598 | 1.69 % |
| Mar | 616090 | 16183 | 2.63 % |
| Aug | 612279 | 9835 | 1.61 % |
| Jun | 608665 | 10931 | 1.8 % |
| May | 606293 | 6229 | 1.03 % |
| Jan | 605765 | 17308 | 2.86 % |
| Apr | 598126 | 10355 | 1.73 % |
| Feb | 569232 | 20596 | 3.62 % |
| Oct | 556205 | 3249 | 0.58 % |
| Dec | 544958 | 17779 | 3.26 % |
| Sep | 540908 | 9913 | 1.83 % |
| Nov | 523272 | 4458 | 0.85 % |
#prop['Canc. prop (per total cancelled-flights 137,434 of year 2008'] = round(prop['Cancelled']*100/137434 ,
# 2).astype(str) + ' %'
#prop.drop(columns='Canc. prop (per total cancelled-flights 137,434 of year 2008',inplace=True)
print('\nMonths cancelled-flights proportions compared to all 137,434 total cancelled-flights of year 2008 \n\n\
(sorted from the highest proportion)')
col = 'Canc. prop (per total 137,434 cancelled-flights of year 2008)'
round(prop_month.sort_values('Flights',ascending=False)['Cancelled']*100/137434 ,2) \
.to_frame(col).sort_values(col,ascending=False).astype(str) + ' %'
# columns are sorted by total flights number of month
Months cancelled-flights proportions compared to all 137,434 total cancelled-flights of year 2008 (sorted from the highest proportion)
| Canc. prop (per total 137,434 cancelled-flights of year 2008) | |
|---|---|
| Feb | 14.99 % |
| Dec | 12.94 % |
| Jan | 12.59 % |
| Mar | 11.78 % |
| Jun | 7.95 % |
| Jul | 7.71 % |
| Apr | 7.53 % |
| Sep | 7.21 % |
| Aug | 7.16 % |
| May | 4.53 % |
| Nov | 3.24 % |
| Oct | 2.36 % |
# Months flights cancellation Vs. delay causes times
plt.figure(figsize=(20,10))
ax = sns.countplot(data=airline_cancelled,
x='Month',
hue='CancellationCode',hue_order=['A','B','C','D'],
palette='bright')
sns.despine()
title = 'Flights cancellation-causes frequencies per months \nfor the year 2008'
plt.title(title + '\n(sorted upon months order within the year)',
fontsize=21,color='#BF1B00',weight='bold',y=1.14)
plt.xlabel('Months',fontsize=18,weight='bold')
plt.ylabel('Freq.',fontsize=18,weight='bold')
plt.xticks(fontsize=16,weight='bold')
plt.yticks(np.arange(0,12000,2000),
[0] + [str(n)+'k' for n in range(2,11,2)],
#[0,'2k','4k','6k','8k','10k'],
fontsize=16,weight='bold')
legend = plt.legend(title='Causes',
title_fontsize=18,ncol=4,loc='upper center',bbox_to_anchor=(.5,1.15),
labels=['Carrier','Weather','National Air System','Security'],
fontsize=13,prop={'size':20},framealpha =.9)
plt.setp(legend.get_title(),fontsize=18,weight='bold')
# drawing max and min lines
max_min_lines = {'carrier':(7312,2097),
'weather':(10042,600),
'NAS':(4470,552)}
colors = (color for color in list(sns.color_palette('bright')))
for m in max_min_lines:
max,min = max_min_lines[m]
color=next(colors)
plt.axhline(max,color=color,linewidth=1, linestyle ='-')
plt.axhline(min ,color=color,linewidth=1, linestyle ='-');
#'-', '--', '-.', ':', 'None', ' ', '', 'solid', 'dashed', 'dashdot', 'dotted'
print('\nHint: Colored lines indicate the Max and Min frequencies')
#plt.grid(True,axis='y');
Hint: Colored lines indicate the Max and Min frequencies
# flights cancellation-causes times per months
print('\nFlights cancellation-causes times per months\n\n(Sorted from highest total cancellation times)')
airline_cancelled_dummies = pd.get_dummies(data=airline_cancelled,
columns=['CancellationCode'],
prefix='',
prefix_sep='')
#airline_cancelled_dummies.drop(columns='Month',inplace=True)
airline_cancelled_dummies['Total flights-cancellation causes times'] = \
airline_cancelled_dummies['A'] + airline_cancelled_dummies['B'] + \
airline_cancelled_dummies['C'] + airline_cancelled_dummies['D']
airline_cancelled_dummies.rename(columns={'A':'Carrier cancellation',
'B':'Weather cancellation',
'C':'National Air system cancellation',
'D':'Security cancellation'},
inplace=True)
airline_cancelled_dummies.groupby('Month').sum()\
.sort_values(by='Total flights-cancellation causes times',ascending=False)\
#.iloc[:,[0,1,2,3]]\
#.plot(kind='barh',figsize=(15,15));
Flights cancellation-causes times per months (Sorted from highest total cancellation times)
| Carrier cancellation | Weather cancellation | National Air system cancellation | Security cancellation | Total flights-cancellation causes times | |
|---|---|---|---|---|---|
| Month | |||||
| Feb | 6090.0 | 10034.0 | 4470.0 | 2.0 | 20596.0 |
| Dec | 4850.0 | 10042.0 | 2887.0 | 0.0 | 17779.0 |
| Jan | 6635.0 | 6711.0 | 3962.0 | 0.0 | 17308.0 |
| Mar | 6038.0 | 7520.0 | 2621.0 | 4.0 | 16183.0 |
| Jun | 4251.0 | 3448.0 | 3232.0 | 0.0 | 10931.0 |
| Jul | 5292.0 | 2612.0 | 2694.0 | 0.0 | 10598.0 |
| Apr | 7312.0 | 1479.0 | 1564.0 | 0.0 | 10355.0 |
| Sep | 2246.0 | 6478.0 | 1188.0 | 1.0 | 9913.0 |
| Aug | 3852.0 | 3582.0 | 2401.0 | 0.0 | 9835.0 |
| May | 3157.0 | 1461.0 | 1608.0 | 3.0 | 6229.0 |
| Nov | 2510.0 | 937.0 | 1009.0 | 2.0 | 4458.0 |
| Oct | 2097.0 | 600.0 | 552.0 | 0.0 | 3249.0 |
#{'A':'Carrier cancellation',
# 'B':'Weather cancellation',
# 'C':'National Air system cancellation',
# 'D':'Security cancellation'},
print('\n','Cancellation Codes: A = Carrier | B = Weather | C = National Air system | D = Security')
airline_cancelled.groupby(['Month','CancellationCode']).size().to_frame('Freq.')
Cancellation Codes: A = Carrier | B = Weather | C = National Air system | D = Security
| Freq. | ||
|---|---|---|
| Month | CancellationCode | |
| Jan | A | 6635 |
| B | 6711 | |
| C | 3962 | |
| D | 0 | |
| Feb | A | 6090 |
| B | 10034 | |
| C | 4470 | |
| D | 2 | |
| Mar | A | 6038 |
| B | 7520 | |
| C | 2621 | |
| D | 4 | |
| Apr | A | 7312 |
| B | 1479 | |
| C | 1564 | |
| D | 0 | |
| May | A | 3157 |
| B | 1461 | |
| C | 1608 | |
| D | 3 | |
| Jun | A | 4251 |
| B | 3448 | |
| C | 3232 | |
| D | 0 | |
| Jul | A | 5292 |
| B | 2612 | |
| C | 2694 | |
| D | 0 | |
| Aug | A | 3852 |
| B | 3582 | |
| C | 2401 | |
| D | 0 | |
| Sep | A | 2246 |
| B | 6478 | |
| C | 1188 | |
| D | 1 | |
| Oct | A | 2097 |
| B | 600 | |
| C | 552 | |
| D | 0 | |
| Nov | A | 2510 |
| B | 937 | |
| C | 1009 | |
| D | 2 | |
| Dec | A | 4850 |
| B | 10042 | |
| C | 2887 | |
| D | 0 |
# plotting Month Vs ArrDelay means
month_arr_means = df_inv.loc[~(df_inv['ArrDelay'].isin(zero_or_null)),['Month','ArrDelay']] \
.groupby('Month').mean() #.sort_values(by='ArrDelay',ascending=False)
plt.figure(figsize=[15,8])
#sns.pointplot(data=month_arr_means, x=month_arr_means.index , y='ArrDelay',
# order=month_arr_means.index,
# linestyles='',color=sns.color_palette()[0])
#plt.errorbar( x=month_arr_means.index , y=month_arr_means['ArrDelay'],
# #order=month_arr_means.index,
# #linestyles='',
# )#color=sns.color_palette()[0])
sns.pointplot( x=month_arr_means.index , y=month_arr_means['ArrDelay'],
#order=month_arr_means.index,
linestyles='dotted',
)#color=sns.color_palette()[0])
sns.despine(left=True)
sns.set_theme(font_scale=1.19,style='ticks')
plt.title('Arrival delays (15 min. or more) means \nper months \n for the year 2008 flights',
fontsize=21,weight='bold',y=1.02)
plt.xlabel('Month',fontsize=18,weight='bold')
plt.ylabel('Arrival delay mean (min.)',fontsize=18,weight='bold')
plt.ylim((0,18))
plt.grid(True,axis='both')
plt.xticks(fontsize=14,weight='bold')
# get the current tick locations and labels
locs, labels = plt.xticks(fontsize=13.75,weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
mean = month_arr_means.loc[label.get_text(),'ArrDelay']
string = '{:0.2f} min.'.format(mean)
# print the annotation next to the bar
plt.text(loc, mean+.33, s=string, ha = 'center', color = 'black',fontsize=13.7,weight='bold');
# plotting Month Vs ArrDelay means
month_arr_means = df_inv.loc[~(df_inv['ArrDelay'].isin(zero_or_null)),['Month','ArrDelay']] \
.groupby('Month').mean() \
.sort_values(by='ArrDelay',ascending=False)
plt.figure(figsize=[15,8])
#sns.pointplot(data=month_arr_means, x=month_arr_means.index , y='ArrDelay',
# order=month_arr_means.index,
# linestyles='',color=sns.color_palette()[0])
#plt.errorbar( x=month_arr_means.index , y=month_arr_means['ArrDelay'],
# #order=month_arr_means.index,
# #linestyles='',
# )#color=sns.color_palette()[0])
sns.pointplot( x=month_arr_means.index , y=month_arr_means['ArrDelay'],
order=month_arr_means.index,
linestyles='',
)#color=sns.color_palette()[0])
sns.despine(left=True)
sns.set_theme(font_scale=1.19,style='ticks')
plt.title('Arrival delays (15 min. or more) means per months \n for the year 2008 flights\n(sorted from highest mean)',
fontsize=21,weight='bold',y=1.02)
plt.xlabel('Month',fontsize=18,weight='bold')
plt.ylabel('Arrival delay mean (min.)',fontsize=18,weight='bold')
plt.ylim((0,18))
plt.grid(True,axis='both')
plt.xticks(fontsize=14,weight='bold')
# get the current tick locations and labels
locs, labels = plt.xticks(fontsize=13.75,weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
mean = month_arr_means.loc[label.get_text(),'ArrDelay']
string = '{:0.2f} min.'.format(mean)
# print the annotation next to the bar
plt.text(loc, mean+.33, s=string, ha = 'center', color = 'black',fontsize=13.7,weight='bold');
plt.figure(figsize = [15,8])
ax = sns.pointplot(data = melt_15, x = 'Month', y = 'ArrDelay', hue = 'Cause',
linestyles='dotted',dodge= 0.2,scale=1.15
,palette = 'rocket_r')
ax.set_title('Arrival delays (15 min. or more) means by delay-causes \n per Months\n for the year 2008',
fontsize=21,weight='bold')
ax.set_xlabel('Month',fontsize=18,weight='bold')
ax.set_ylabel('Arrival delay mean (min.)',fontsize=18,weight='bold')
plt.xticks(fontsize=15,weight='bold')
plt.yticks(fontsize=15,weight='bold')
legend = ax.legend(loc = 6, ncol = 1, framealpha = .8, title = 'Delay causes',# labels=causes,
fontsize=14, bbox_to_anchor = (1,0.5),prop={'size':18},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large')
sns.despine(left=True,bottom=True)
plt.grid(True);
print('\nFlights delays means per months and causes')
columns={'ArrDelay':'Arrival delay mean (min.)','Minutes':'Cause delay mean (min.)'}
#melt_15[list(set(melt_15.columns) -{'Cancelled'})].groupby(['Month','Cause']).mean().rename(columns=columns)
melt_15.groupby(['Month','Cause']).mean().rename(columns=columns)
Flights delays means per months and causes
| Arrival delay mean (min.) | Cause delay mean (min.) | ||
|---|---|---|---|
| Month | Cause | ||
| Jan | Carrier | 56.171442 | 36.079481 |
| National Air System | 54.892898 | 28.103122 | |
| Security | 37.524057 | 16.790637 | |
| Weather | 80.988028 | 43.981173 | |
| Feb | Carrier | 58.401192 | 36.286006 |
| National Air System | 57.075527 | 27.780957 | |
| Security | 42.727987 | 17.863208 | |
| Weather | 82.425008 | 44.660814 | |
| Mar | Carrier | 56.856152 | 34.910855 |
| National Air System | 55.464507 | 26.868463 | |
| Security | 37.348328 | 17.435986 | |
| Weather | 84.570406 | 46.547068 | |
| Apr | Carrier | 55.486343 | 35.108070 |
| National Air System | 50.932327 | 25.086923 | |
| Security | 35.063025 | 16.060924 | |
| Weather | 83.361501 | 48.953925 | |
| May | Carrier | 53.620574 | 33.535753 |
| National Air System | 49.617749 | 26.008819 | |
| Security | 38.816701 | 19.859470 | |
| Weather | 76.167979 | 41.997025 | |
| Jun | Carrier | 59.820534 | 35.793324 |
| National Air System | 59.588023 | 30.752175 | |
| Security | 40.255499 | 17.959391 | |
| Weather | 86.900625 | 47.672530 | |
| Jul | Carrier | 61.506064 | 39.506669 |
| National Air System | 62.192084 | 33.147465 | |
| Security | 43.842610 | 25.193858 | |
| Weather | 89.995229 | 55.242421 | |
| Aug | Carrier | 59.367418 | 38.772434 |
| National Air System | 57.919304 | 30.155358 | |
| Security | 37.902116 | 19.613757 | |
| Weather | 82.097683 | 46.223175 | |
| Sep | Carrier | 53.721025 | 39.542116 |
| National Air System | 48.372589 | 27.562945 | |
| Security | 37.636816 | 23.995025 | |
| Weather | 71.277867 | 50.443945 | |
| Oct | Carrier | 51.009869 | 34.557236 |
| National Air System | 44.460146 | 24.703283 | |
| Security | 37.105634 | 18.609155 | |
| Weather | 76.976979 | 42.444265 | |
| Nov | Carrier | 55.038367 | 33.525817 |
| National Air System | 51.180908 | 28.693960 | |
| Security | 37.261411 | 16.958506 | |
| Weather | 76.694975 | 38.078099 | |
| Dec | Carrier | 63.262444 | 33.489687 |
| National Air System | 62.661970 | 27.564933 | |
| Security | 41.957162 | 16.306560 | |
| Weather | 90.865911 | 44.833603 |
plt.figure(figsize = [15,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.boxplot(data = melt_15, x = 'Month', y = 'ArrDelay', hue = 'Cause',
dodge= 0.2,#scale=1.15 ,linestyles='',
palette = 'rocket_r')
ax.set_title('Arrival delay (15 min. or more) by delay-causes boxplots\n per Months\n for the year 2008',
fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Month',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.xticks(fontsize=15,weight='bold',rotation=90)
plt.yticks(np.arange(0,3000,500),
[0] + [str(n)+'k' for n in np.arange(0.5,3,0.5)],
#[0] + [str(n/10)+'k' for n in range(5,30,5)],
fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large', color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
plt.figure(figsize = [15,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.boxplot(data = melt_15, x = 'Month', y = 'ArrDelay', hue = 'Cause',
order=month_arr_means.index,
dodge= 0.2,#scale=1.15 ,linestyles='',
palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes boxplots\nper Months (from highest total arrival delay mean)'
ax.set_title(title+'\nfor the year 2008',fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Month',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.xticks(fontsize=15,weight='bold',rotation=90)
plt.yticks(np.arange(0,3000,500),
[0] + [str(n)+'k' for n in np.arange(0.5,3,0.5)],
#[0] + [str(n/10)+'k' for n in range(5,30,5)],
fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large', color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
plt.figure(figsize = [18,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.boxplot(data = melt_15, x = 'Month', y = 'ArrDelay', hue = 'Cause',
dodge= 0.2,#scale=1.15 ,linestyles='',
palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes boxplots\n per Months\n for the year 2008'
ax.set_title(title + '\n (limiting to 200 min.)',
fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Month',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.ylim(0,200)
plt.xticks(fontsize=15,weight='bold')
plt.yticks(fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large', color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
plt.figure(figsize = [18,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.boxplot(data = melt_15, x = 'Month', y = 'ArrDelay', hue = 'Cause',
order=month_arr_means.index,
dodge= 0.2,#scale=1.15 ,linestyles='',
palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes boxplots\n per Months (from highest total arrival delay mean)'
ax.set_title(title + '\n for the year 2008\n (limiting to 125 min.)',
fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Month',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.ylim(0,125)
plt.xticks(fontsize=15,weight='bold')
plt.yticks(fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large', color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
# # bivariate plots of Arrival-delays (15+ min., limited to 150) boxplots per delay causes by Months for year 2008 flights
# 1.5+ million flights
sns.set(font_scale=1.19,style='whitegrid')
order = melt_15['Cause'].unique()
# g = sns.FacetGrid(data=melt_15,col='Month',
# col_wrap=4,
# hue='Cause',hue_order= order,palette='pastel'
# )#,xlim=(0,300))
g = sns.catplot(kind='box',
data=melt_15,col='Month',col_order=month_arr_means.index,
col_wrap=4,
x='Cause',order=order,
y='ArrDelay',
hue='Cause',hue_order= order,palette='pastel',#'rocket_r',
height=3.5,facet_kws=dict(ylim=(0,150))
)
title = 'Arrival-delays (15+ min., limited to 150) boxplots by delay causes\n'
g.fig.suptitle(title + 'per Months (from highest total arrival delay mean)\n' + 'for the year 2008',
color='#BF1B00',
fontsize=21,
x=0.331,y=1.17)
# g.map(sns.barplot,'Cause','Minutes',order=order,)
g.set_titles('{col_name}',fontsize=18,weight='bold')
g.add_legend(title = 'Delay causes', #labels=['Carrier', 'Weather', 'National Air System', 'Security'])
loc='upper center',
ncol=4,
fontsize=14,
bbox_to_anchor=(0.331,1.073),
frameon=True,framealpha=.8)
plt.setp(g._legend.get_title(),fontsize=16)
g.set_xlabels('Cause',weight='bold',fontsize=14)#
g.set_xticklabels([cause.replace('\n',' ') for cause in causes],
rotation=90,fontsize=13)#,weight='bold'
g.set_ylabels('Arrival delay (min.)',weight='bold',fontsize=14);
ytickslabels = (g.axes[0].get_yticks()).astype('int')
g.set_yticklabels(ytickslabels,fontsize=13);#fontsize=12,color='blue',weight='bold';
plt.figure(figsize = [15,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.violinplot(data = melt_15, x = 'Month', y = 'ArrDelay', hue = 'Cause',
#order=month_arr_means.index,
dodge= 0.2,#scale=1.15 ,linestyles='',
)#palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes violinplots\nper Months'# (from highest total arrival delay mean)'
ax.set_title(title+'\nfor the year 2008',fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Month',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.xticks(fontsize=15,weight='bold',rotation=90)
plt.yticks(np.arange(0,3000,500),
[0] + [str(n)+'k' for n in np.arange(0.5,3,0.5)],
#[0] + [str(n/10)+'k' for n in range(5,30,5)],
fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large', color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
plt.figure(figsize = [15,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.violinplot(data = melt_15, x = 'Month', y = 'ArrDelay', hue = 'Cause',
order=month_arr_means.index,
dodge= 0.2,#scale=1.15 ,linestyles='',
)#palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes violinplots\nper Months (from highest total arrival delay mean)'
ax.set_title(title+'\nfor the year 2008',fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Month',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.xticks(fontsize=15,weight='bold',rotation=90)
plt.yticks(np.arange(0,3000,500),
[0] + [str(n)+'k' for n in np.arange(0.5,3,0.5)],
#[0] + [str(n/10)+'k' for n in range(5,30,5)],
fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large', color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
plt.figure(figsize = [18,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.violinplot(data = melt_15, x = 'Month', y = 'ArrDelay', hue = 'Cause',
#order=month_arr_means.index,
dodge= 0.2,#scale=1.15 ,linestyles='',
)#palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes violinplots\n per Months'# (from highest total arrival delay mean)'
ax.set_title(title +'\n for the year 2008' + '\n (limiting to 150 min.)',
fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Month',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.ylim(-10,150)
plt.xticks(fontsize=15,weight='bold')
plt.yticks(fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large', color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
plt.figure(figsize = [18,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.violinplot(data = melt_15, x = 'Month', y = 'ArrDelay', hue = 'Cause',
order=month_arr_means.index,
dodge= 0.2,#scale=1.15 ,linestyles='',
)#palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes violinplots\n per Months (from highest total arrival delay mean)'
ax.set_title(title +'\n for the year 2008' + '\n (limiting to 150 min.)',
fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Month',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.ylim(-10,150)
plt.xticks(fontsize=15,weight='bold')
plt.yticks(fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large', color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
# # bivariate plots of Arrival-delays (15+ min., limited to 150) violinplots per delay causes by Months for year 2008 flights
# 1.5+ million flights
sns.set(font_scale=1.19,style='whitegrid')
order = melt_15['Cause'].unique()
# g = sns.FacetGrid(data=melt_15,col='Month',
# col_wrap=4,
# hue='Cause',hue_order= order,palette='pastel'
# )#,xlim=(0,300))
g = sns.catplot(kind='violin',
data=melt_15,col='Month',col_order=month_arr_means.index,
col_wrap=4,
x='Cause',order=order,
y='ArrDelay',
hue='Cause',hue_order= order,palette='pastel',#'rocket_r',
height=3.5,facet_kws=dict(ylim=(0,150))
)
title = 'Arrival-delays (15+ min., limited to 150) violinplots by delay causes\n'
g.fig.suptitle(title + 'per Months (from highest total arrival delay mean)\n' + 'for the year 2008',
color='#BF1B00',
fontsize=21,
x=0.331,y=1.17)
# g.map(sns.barplot,'Cause','Minutes',order=order,)
g.set_titles('{col_name}',fontsize=18,weight='bold')
g.add_legend(title = 'Delay causes', #labels=['Carrier', 'Weather', 'National Air System', 'Security'])
loc='upper center',
ncol=4,
fontsize=14,
bbox_to_anchor=(0.331,1.073),
frameon=True,framealpha=.8)
plt.setp(g._legend.get_title(),fontsize=16)
g.set_xlabels('Cause',weight='bold',fontsize=14)#
g.set_xticklabels([cause.replace('\n',' ') for cause in causes],
rotation=90,fontsize=13)#,weight='bold'
g.set_ylabels('Arrival delay (min.)',weight='bold',fontsize=14);
ytickslabels = (g.axes[0].get_yticks()).astype('int')
g.set_yticklabels(ytickslabels,fontsize=13);#fontsize=12,color='blue',weight='bold';
# month_counts.to_frame('Month').rename(columns={'Month':'Flights counts'}).sort_values('Flights counts',ascending=False)
# calculating Flights proportions and counts per each month
# creating 'prop' dataframe and sorting by total flights
# prop_aireline = df_inv['Carrier'].value_counts().to_frame('Flights').sort_values('Flights',ascending=True)
# prop_aireline['On-time'] = df_inv[(df_inv.ArrDelay < 15) & (~df_inv.ArrDelay.isna())]['Carrier'].value_counts()
# prop_aireline['On-time prop (per airline flights)'] = round(100 * prop_aireline['On-time'] / prop_aireline['Flights'],2)
# prop_aireline['On-time prop (per airline flights)'] = prop_aireline['On-time prop (per airline flights)'].astype('str') + ' %'
# prop_aireline['Delayed'] = df_inv_15['Carrier'].value_counts()
# prop_aireline['Del. prop (per airline flights)'] = round(100 * prop_aireline['Delayed'] / prop_aireline['Flights'],2)
# prop_aireline['Del. prop (per airline flights)'] = prop_aireline['Del. prop (per airline flights)'].astype('str') + ' %'
# prop_aireline['Cancelled'] = airline_cancelled['Carrier'].value_counts()
# prop_aireline['Canc. prop (per airline flights)'] = round(100 * prop['Cancelled'] / prop_aireline['Flights'],2)
# prop_aireline['Canc. prop (per airline flights)'] = prop_aireline['Canc. prop (per airline flights)'].astype('str') + ' %'
# prop_aireline['Diverted'] = prop['Flights'] - (prop_aireline['On-time']+prop_aireline['Delayed']+prop_aireline['Cancelled'])
# prop_aireline['Div. prop (per airline flights)'] = round(100 * prop_aireline['Diverted'] / prop_aireline['Flights'],2)
# prop_aireline['Div. prop (per airline flights)'] = prop_aireline['Div. prop (per airline flights)'].astype('str') + ' %'
# prop_aireline
#### plotting clustered bar chart for flights counts per airline
colors = [sns.color_palette()[0] , sns.color_palette()[2] , sns.color_palette()[1] , sns.color_palette()[3]]
ax = prop_aireline.iloc[:,reversed([0,1,3,5])]\
.sort_values('Flights',ascending=True)\
.plot(kind='barh',figsize=(15,18),rot=0,color=colors[::-1])
#color=list(sns.color_palette()[0:4][::-1] #['r', 'g' ,'orange','b']
sns.set(font_scale=1.19,style='ticks')
sns.despine(bottom=True)
plt.title('Flights frequencies by airlines \nfor the year 2008\n(sorted from highest total flights)',
fontsize=24,color='#BF1B00',weight='bold',y=1.1)
plt.ylabel('Airline',fontsize=18,weight='bold')
plt.xlabel('')#'Freq.',fontsize=17,weight='bold')
plt.yticks(fontsize=14,weight='bold')
plt.xticks(np.arange(0,1.4,.2)*10**6,
[0] + [str(n/10)+'e6' for n in range(2,14,2)],
#[0,'0.2e6','0.4e6','0.6e6','0.8e6','1.0e6','1.2e6'],
fontsize=14,weight='bold')
ax.xaxis.tick_top()
# ------ adjusting legend
# handles, labels = ax.get_legend_handles_labels()
labels=['All','On-time','Delayed','Cancelled']
# ax.legend(handles[::-1], labels,
# ncol=4,
# loc='upper center',bbox_to_anchor=(.5,1.09),
# fontsize=13,framealpha =.9,#,bbox_to_anchor=(.5,1.047),
# prop={'size':20})
plt.legend(reversed(plt.legend().legendHandles), labels, #reversed(labels),
ncol=4,
loc='upper center',bbox_to_anchor=(.5,1.09),
fontsize=13,framealpha =.9,#,bbox_to_anchor=(.5,1.047),
prop={'size':20},
)# colors=list(sns.color_palette()[0:4][::-1]))
plt.grid(True,axis='x')
# drawing max and min lines
max_min_lines = {'on-time':(967154,7358),
'delayed':(219757,394),
'cancelled':(18331,42)}
colors = (color for color in colors[1:])
for m in max_min_lines:
max,min = max_min_lines[m]
color=next(colors)
plt.axvline(max,color=color,linewidth=1.7, linestyle ='-');
plt.axvline(min,color=color,linewidth=1.7, linestyle ='-');
#'-', '--', '-.', ':', 'None', ' ', '', 'solid', 'dashed', 'dashdot', 'dotted'
print('\nHint: Colored lines indicate the Max and Min frequencies')
Hint: Colored lines indicate the Max and Min frequencies
colors = [sns.color_palette()[2],sns.color_palette()[1],sns.color_palette()[3],sns.color_palette()[-5]]
ax = prop_aireline.sort_values('Flights',ascending=True) \
.iloc[:,[1,3,5,7]] \
.plot(kind='barh',figsize=(15,12),rot=0,stacked=True,color=colors)
sns.set(font_scale=1.19,style='ticks')
sns.despine(left=True,bottom=True)
plt.title('Flights frequencies (stacked) by airlines \nfor the year 2008\n(sorted from highest total flights)',
fontsize=25,color='#BF1B00',weight='bold',y=1.14)
plt.ylabel('Aireline',fontsize=18,weight='bold')
plt.yticks(fontsize=14,weight='bold')#,rotation=90)
plt.xticks(np.arange(0,1.5,.2)*10**6,
[0] + [str(n/10)+'e6' for n in range(2,15,2)],
#[0,'0.2e6','0.4e6','0.6e6','0.8e6','1e6','1.2e6','1.4e6'],
fontsize=14,weight='bold',)
plt.xlim(0,1.3*10**6)
plt.legend(loc='upper center',#bbox_to_anchor=(.5,1.07),
fontsize=13,framealpha =.9,
ncol=4,#labelcolor ='red'),
#loc='center right',bbox_to_anchor=(1.2,.4),#,bbox_to_anchor=(.5,1.047),
bbox_to_anchor=(.5,1.13),
#,bbox_to_anchor=(.5,1.047),
prop={'size':20})
plt.grid(True,axis='x')
ax.xaxis.tick_top()
# drawing max and min lines
max_min_lines = {'on-time':(967154,7358),
'delayed':((219757 + 967154) , (394+7358)) ,
'cancelled':((18331 + 113672 + 357489) , (42 + 394 + 7358))}
color_gen = (color for color in colors)
for flights in max_min_lines:
max,min = max_min_lines[flights]
color=next(color_gen)
plt.axvline(max,color=color,ymin=.35
,linewidth=1.7
,linestyle ='-')
plt.text(max+80000,14.6,'Max value line',color=color,ha='center',weight='bold')
plt.axvline(min,color=color,ymax=.15#xmin=.68,xmax=.91
#,linewidth=1.2
,linestyle ='-')
plt.text(min+70000,1 * 10**-7,'Min value line',color='black',ha='center')
print('\nHint: Colored lines indicate the Max and Min frequencies')
Hint: Colored lines indicate the Max and Min frequencies
# Carriers against flights counts
# for the whole data points df_inv (7,009,728 rows), against on_time (5,330,292 rows),
# df_inv_15 (1,524,733 rows) , airline_cancelled (137,434 rows)
fig , ax = plt.subplots(2,2,sharex=True,figsize=(15,29.3))
#ax = ax.flatten()
plt.suptitle('Carriers flights counts (from highest) and proportions \nfor the year 2008',
fontsize=22,color='#BF1B00',
#x=.45,
y=.94)
# first subplot
# Total flights per Carriers
ax[0,0] = plt.subplot(2,2,1)
carrier_counts = df_inv['Carrier'].value_counts()
sns.countplot(data = df_inv, y = 'Carrier',
order = carrier_counts.index,
color = sns.color_palette('rocket')[1],
#sns.color_palette('vlag')[-1],#sns.color_palette('rocket')[0],#default_color,
ax=ax[0,0])
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
plt.title('Totals \n[7,009,724 flights]',fontsize=21,color='#AD1FFF',y=1.023)#,x=.45,y=1.0365)
plt.xlabel('')
plt.ylabel('Carrier',fontsize=18)#weight='bold',
plt.xticks([])
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = carrier_counts[label.get_text()]
percentage_total = 100*count/7009724
string = str(f"{count:,}") + "\n{:0.2f}%".format(percentage_total) #+ ' flights'
# print the annotation next to the bar
plt.text(count+3000, loc, s= string, va = 'center', color = 'black');
#########
# second subplot
# on-time flights by Carriers
ax[0,1] = plt.subplot(2,2,2)
carrier_counts = on_time['Carrier'].value_counts()
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
sns.countplot(data = on_time, y = 'Carrier',
order = carrier_counts.index,
color = sns.color_palette('vlag')[0],#sns.color_palette('viridis')[-3],# 'skyblue',
ax=ax[0,1])
plt.title('On-time \n[5,330,292 flights]',fontsize=21,color='#AD1FFF',y=1.023)#,x=.25
plt.xlabel('')
plt.ylabel('')
#plt.ylabel('Month',weight='bold',fontsize=18)
plt.xticks([])
plt.yticks(color = default_color)
# Recalculating the total flights counts for months
carrier_flights_count_whole = df_inv['Carrier'].value_counts() # to be used in calculating percentages of carrier flights
# counts with delays of 15 minutes or more
# against carrier total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = carrier_counts[label.get_text()]
count_whole = carrier_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/5330292
string = str(f"{count:,}") + " ({:0.2f}% of carrier's)\n{:0.2f}%".format(percentage_value,percentage_total)# flights
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black');
##################################
# third subplot
# delayed flights by Months
ax[1,0] = plt.subplot(2,2,3)
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
carrier_counts = df_inv_15['Carrier'].value_counts()
sns.countplot(data = df_inv_15, y = 'Carrier',
order = carrier_counts.index,
color = 'skyblue',
ax=ax[1,0])
plt.title('Delayed\n(15 min. arrival delay or more)\n[1,524,733 flights]',fontsize=21,color='#AD1FFF',y=1.023)
#,x=.45,y=1.0365)#,x=.25
plt.xlabel('')
plt.ylabel('')
plt.ylabel('Carrier',fontsize=18)#weight='bold',
plt.xticks([])
plt.yticks(color = 'brown')
# Recalculating the total flights counts for carriers
carrier_flights_count_whole = df_inv['Carrier'].value_counts() # to be used in calculating percentages of carrier flights
# counts with delays of 15 minutes or more
# against carrier total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = carrier_counts[label.get_text()]
count_whole = carrier_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/1524733
string = str(f"{count:,}") + " ({:0.2f}% of carrier's)\n{:0.2f}%".format(percentage_value,percentage_total)# flights
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black');
#################
# fourth subplot
# cancelled flights by Months
ax[1,1] = plt.subplot(2,2,4)
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
carrier_counts = airline_cancelled['Carrier'].value_counts()
sns.countplot(data = airline_cancelled ,
y = 'Carrier',
order = carrier_counts.index,
color = sns.color_palette()[1],
ax=ax[1,1])
plt.title('Cancelled \n[137,434 flights]',fontsize=21,color='#AD1FFF',y=1.0365)#x=.205,y=1.0365)
plt.xlabel('')
plt.ylabel('')
#plt.ylabel('Carrier',weight='bold',fontsize=18)
plt.xticks([])
plt.yticks(color = sns.color_palette()[1],fontsize=12.5)
# Recalculating the total flights counts for months
carrier_flights_count_whole = df_inv['Carrier'].value_counts() # to be used in calculating percentages of carrier flights
# counts with delays of 15 minutes or more
# against carrier total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = carrier_counts[label.get_text()]
count_whole = carrier_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/137434
string = str(f"{count:,}") + " ({:0.2f}% of month's)\n{:0.2f}%".format(percentage_value,percentage_total)# flights
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black');#,fontsize=12.5);
#########
plt.subplots_adjust(wspace=0.638,hspace=.12);
# Carrier against flights counts
# for the whole data points df_inv (7,009,728 rows), against df_inv_15 (1,524,733 rows) , airline_cancelled (137,434 rows)
default_color = sns.color_palette()[0]
fig, ax = plt.subplots(1, 3, sharex=True,figsize=(15.5,21.8))
plt.suptitle('Airlines flights counts (from highest) and proportions \nfor the year 2008',
fontsize=22,color='#BF1B00',
x=.45,y=.99)
# first subplot
ax[0] = plt.subplot(1,3,1)
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
sns.countplot(data = airline_cancelled ,
y = 'Carrier',
order = airline_cancelled['Carrier'].value_counts().index,
color = sns.color_palette()[1],
ax=ax[0])
plt.title('Cancelled \n[137,434 flights]',fontsize=21,color='#AD1FFF',x=.205,y=1.0365)
plt.xlabel('')
plt.ylabel('Carrier',weight='bold',fontsize=18)
plt.xticks([])
plt.yticks(color = sns.color_palette()[1],fontsize=12.5)
# Recalculating the flights_counts for carriers
carrier_flights_count = airline_cancelled['Carrier'].value_counts()
carrier_flights_count_whole = df_inv['Carrier'].value_counts() # to be used in calculating percentages of carrier flights
# counts with delays of 15 minutes or more
# against carrier total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = carrier_flights_count[label.get_text()]
count_whole = carrier_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/137434
string = str(f"{count:,}") + " ({:0.1f}% of carrier's)\n{:0.2f}%".format(percentage_value,percentage_total)# flights
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black',fontsize=14);
##################################
# second subplot
ax[1] = plt.subplot(1,3,2)
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
sns.countplot(data = df_inv_15, y = 'Carrier',
order = df_inv_15['Carrier'].value_counts().index,
color = 'skyblue',
ax=ax[1])
plt.title('Delayed\n(15 min. arrival delay or more)\n[1,524,733 flights]',fontsize=21,color='#AD1FFF',x=.25,y=1.023)
plt.xlabel('')
plt.ylabel('')
#plt.ylabel('Carrier',weight='bold',fontsize=18)
plt.xticks([])
plt.yticks(color = 'brown')
# Recalculating the flights_counts for carriers
carrier_flights_count = df_inv_15['Carrier'].value_counts()
carrier_flights_count_whole = df_inv['Carrier'].value_counts() # to be used in calculating percentages of carrier flights
# counts with delays of 15 minutes or more
# against carrier total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = carrier_flights_count[label.get_text()]
count_whole = carrier_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/1524733
string = str(f"{count:,}") + " ({:0.1f}% of carrier's)\n{:0.2f}%".format(percentage_value,percentage_total)# flights
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black',fontsize=14);
##################################
# third subplot
ax[2] = plt.subplot(1,3,3)
sns.countplot(data = df_inv, y = 'Carrier',
order = df_inv['Carrier'].value_counts().index,
color = default_color,
ax=ax[2])
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
plt.title(' Totals \n[7,009,724 flights]',fontsize=21,color='#AD1FFF',x=.45,y=1.0365)
plt.xlabel('')
plt.ylabel('')
plt.xticks([])
# Recalculating the flights_counts for carriers
carrier_flights_count = df_inv['Carrier'].value_counts()
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = carrier_flights_count[label.get_text()]
percentage_total = 100*count/7009724
string = str(f"{count:,}") + "\n{:0.2f}%".format(percentage_total) #+ ' flights'
# print the annotation next to the bar
plt.text(count+3000, loc, s= string, va = 'center', color = 'black');
#########
plt.subplots_adjust(wspace=1.33);
#if percentage_value >;
#plt.yticks([])
# Carrier against flights counts
# for the whole data points df_inv (7,009,728 rows), against df_inv_15 (1,524,733 rows)
default_color = sns.color_palette()[0]
fig, ax = plt.subplots(1, 2, sharex=True,figsize=(15,15))
plt.suptitle('Airlines flights counts (from highest) and proportions\n for the year 2008'
,fontsize=25,color='#BF1B00',x=.45,y=1.012)
# first subplot
ax[0] = plt.subplot(1,2,1)
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
sns.countplot(data = df_inv_15, y = 'Carrier',
order = df_inv_15['Carrier'].value_counts().index,
color = 'skyblue',
ax=ax[0])
plt.title('Delayed \n(15 min. arrival delay or more) \n[1,524,733 flights]',
fontsize=21,color='#AD1FFF',x=.25,y=1.015)
plt.xlabel('')
plt.ylabel('Carrier',weight='bold',fontsize=18)
plt.xticks([])
plt.yticks(color = 'brown')
# Recalculating the flights_counts for carriers
carrier_flights_count = df_inv_15['Carrier'].value_counts()
carrier_flights_count_whole = df_inv['Carrier'].value_counts() # to be used in calculating percentages of carrier flights
# counts with delays of 15 minutes or more
# against carrier total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = carrier_flights_count[label.get_text()]
count_whole = carrier_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/1524733
string = str(f"{count:,}") + " ({:0.2f}% of carrier's)\n{:0.2f}%".format(percentage_value,percentage_total)
# print the annotation next to the bar
plt.text(count+3000, loc, s=string, va = 'center', color = 'black');
##################################
# second subplot
ax[1] = plt.subplot(1,2,2)
sns.countplot(data = df_inv, y = 'Carrier',
order = df_inv['Carrier'].value_counts().index,
color = default_color,
ax=ax[1])
sns.set(font_scale=1.19,style='white')
sns.despine(bottom=True)
plt.title(' Totals \n[7,009,724 flights]',fontsize=21,color='#AD1FFF',x=.45,y=1.02)
plt.xlabel('')
plt.ylabel('')
plt.xticks([])
# Recalculating the flights_counts for carriers
carrier_flights_count = df_inv['Carrier'].value_counts()
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = carrier_flights_count[label.get_text()]
percentage_total = 100*count/7009724
string = str(f"{count:,}") + "\n{:0.2f}%".format(percentage_total)
# print the annotation next to the bar
plt.text(count+3000, loc, s= string, va = 'center', color = 'black');
#########
plt.subplots_adjust(wspace=0.3);
# All flights proportions per Airlines
#plt.figure(figsize=(15,15))
#plt.pie()
#plt.plot()
carrier_counts = df_inv['Carrier'].value_counts()
first_10 = list(carrier_counts.head(10).index)
empty_list = ["" for i in range(len(carrier_counts)-10)]
n_list = first_10 + empty_list
carrier_counts.plot(kind='pie',figsize=(14,12),radius=2.5, wedgeprops = {'width' : .4}, pctdistance=1.086, labeldistance=1.15,
label='',labels=n_list,
counterclock=False,startangle=90,
#shadow=True,
explode=[0.08 for i in range(len(carrier_counts))],
autopct="%1.1f%%", colors=sns.color_palette(),fontsize=18.7)#,weight='bold')
plt.title("Total Flights proportions per Airlines \nfor the year 2008\n(7,009,724 flights)",
color='#BF1B00', fontsize=30,y=1.04)
plt.axis("equal")
plt.legend(labels=carrier_counts.index,
loc=6,bbox_to_anchor=(-.659,.5),
fontsize=16,prop={'size':20.15});
# on-time flights proportions per Airlines
#plt.figure(figsize=(15,15))
#plt.pie()
#plt.plot()
carrier_counts = on_time['Carrier'].value_counts()
first_10 = list(carrier_counts.head(10).index)
empty_list = ["" for i in range(len(carrier_counts)-10)]
n_list = first_10 + empty_list
carrier_counts.plot(kind='pie',figsize=(14,12),radius=2.5, wedgeprops = {'width' : .4}, pctdistance=1.082, labeldistance=1.15,
label='',labels=n_list,
counterclock=False,startangle=90,
explode=[0.08 for i in range(len(carrier_counts))],
autopct="%1.1f%%", colors=sns.color_palette(),fontsize=18.5)#,weight='bold')
plt.title("On-time Flights proportions per Airlines \nfor the year 2008\n(5,330,292 flights)",
color='#BF1B00', fontsize=30,y=1.04)
plt.axis("equal")
plt.legend(labels=carrier_counts.index,
loc=6,bbox_to_anchor=(-.659,.5),
fontsize=16,prop={'size':20.15});
# on-time flights proportions per Airlines
#plt.figure(figsize=(15,15))
#plt.pie()
#plt.plot()
carrier_counts = df_inv_15['Carrier'].value_counts()
first_10 = list(carrier_counts.head(10).index)
empty_list = ["" for i in range(len(carrier_counts)-10)]
n_list = first_10 + empty_list
carrier_counts.plot(kind='pie',figsize=(14,12),radius=2.5, wedgeprops = {'width' : .4}, pctdistance=1.082, labeldistance=1.15,
label='',labels=n_list,
counterclock=False,startangle=90,
explode=[0.08 for i in range(len(carrier_counts))],
autopct="%1.1f%%", colors=sns.color_palette(),fontsize=18.5)#,weight='bold')
plt.title("Delayed Flights proportions per Airlines \nfor the year 2008\n(1,524,733 flights)",
color='#BF1B00', fontsize=30,y=1.04)
plt.axis("equal")
plt.legend(labels=carrier_counts.index,
loc=6,bbox_to_anchor=(-.659,.5),
fontsize=16,prop={'size':20.15});
# on-time flights proportions per Airlines
#plt.figure(figsize=(15,15))
#plt.pie()
#plt.plot()
carrier_counts = airline_cancelled['Carrier'].value_counts()
first_10 = list(carrier_counts.head(10).index)
empty_list = ["" for i in range(len(carrier_counts)-10)]
n_list = first_10 + empty_list
carrier_counts.plot(kind='pie',figsize=(14,12),radius=2.5, wedgeprops = {'width' : .4}, pctdistance=1.082, labeldistance=1.15,
label='',labels=n_list,
counterclock=False,startangle=90,
explode=[0.08 for i in range(len(carrier_counts))],
autopct="%1.1f%%", colors=sns.color_palette(),fontsize=18.5)#,weight='bold')
plt.title("Cancelled Flights proportions per Airlines \nfor the year 2008\n(137,434 flights)",
color='#BF1B00', fontsize=30,y=1.04)
plt.axis("equal")
plt.legend(labels=carrier_counts.index,
loc=6,bbox_to_anchor=(-.659,.5),
fontsize=16,prop={'size':20.15});
print('\nYear 2008 flights aggregated counts and proportions per Airlines\n\n\
(sorted from the highest total flights airline had)')
prop_aireline.sort_values('Flights',ascending=False)
# columns are sorted by total flights number of airline
Year 2008 flights aggregated counts and proportions per Airlines (sorted from the highest total flights airline had)
| Flights | On-time | On-time prop (per airline flights) | Delayed | Del. prop (per airline flights) | Cancelled | Canc. prop (per airline flights) | Diverted | Div. prop (per airline flights) | |
|---|---|---|---|---|---|---|---|---|---|
| Southwest Airlines Co. | 1201754 | 967154 | 80.48 % | 219757 | 18.29 % | 12389 | 1.03 % | 2454 | 0.2 % |
| American Airlines Inc. | 604885 | 422461 | 69.84 % | 163024 | 26.95 % | 17440 | 2.88 % | 1960 | 0.32 % |
| Skywest Airlines Inc. | 567159 | 448167 | 79.02 % | 105245 | 18.56 % | 12436 | 2.19 % | 1311 | 0.23 % |
| American Eagle Airlines Inc. | 490693 | 357489 | 72.85 % | 113672 | 23.17 % | 18331 | 3.74 % | 1201 | 0.24 % |
| US Airways Inc. | 453589 | 363153 | 80.06 % | 82933 | 18.28 % | 6582 | 1.45 % | 921 | 0.2 % |
| Delta Air Lines Inc. | 451931 | 345094 | 76.36 % | 98840 | 21.87 % | 6813 | 1.51 % | 1184 | 0.26 % |
| United Air Lines Inc. | 449515 | 321907 | 71.61 % | 116072 | 25.82 % | 10541 | 2.34 % | 995 | 0.22 % |
| Expressjet Airlines Inc. | 374510 | 275531 | 73.57 % | 87883 | 23.47 % | 9992 | 2.67 % | 1104 | 0.29 % |
| Northwest Airlines Inc. | 347652 | 266892 | 76.77 % | 77218 | 22.21 % | 2906 | 0.84 % | 636 | 0.18 % |
| Continental Air Lines Inc. | 298455 | 220808 | 73.98 % | 73047 | 24.48 % | 3702 | 1.24 % | 898 | 0.3 % |
| Atlantic Southeast Airlines | 280575 | 208309 | 74.24 % | 66558 | 23.72 % | 5026 | 1.79 % | 682 | 0.24 % |
| Pinnacle Airlines Inc. | 262208 | 208587 | 79.55 % | 45735 | 17.44 % | 7100 | 2.71 % | 786 | 0.3 % |
| AirTran Airways Corporation | 261684 | 200790 | 76.73 % | 57923 | 22.13 % | 2236 | 0.85 % | 735 | 0.28 % |
| Mesa Airlines Inc. | 254930 | 185981 | 72.95 % | 59150 | 23.2 % | 9219 | 3.62 % | 580 | 0.23 % |
| Comair Inc. | 197607 | 138161 | 69.92 % | 52534 | 26.59 % | 6462 | 3.27 % | 450 | 0.23 % |
| JetBlue Airways | 196091 | 142891 | 72.87 % | 49223 | 25.1 % | 3205 | 1.63 % | 772 | 0.39 % |
| Alaska Airlines Inc. | 151102 | 118302 | 78.29 % | 30190 | 19.98 % | 2139 | 1.42 % | 471 | 0.31 % |
| Frontier Airlines Inc. | 95758 | 75634 | 78.98 % | 19746 | 20.62 % | 303 | 0.32 % | 75 | 0.08 % |
| Hawaiian Airlines Inc. | 61826 | 55623 | 89.97 % | 5589 | 9.04 % | 570 | 0.92 % | 44 | 0.07 % |
| Aloha Airlines Inc. | 7800 | 7358 | 94.33 % | 394 | 5.05 % | 42 | 0.54 % | 6 | 0.08 % |
# Airlines cancelled flights
sns.set(font_scale=1.19,style='white')
plt.figure(figsize=(8,15))
sns.countplot(data = airline_cancelled ,
y = 'Carrier',
order = airline_cancelled['Carrier'].value_counts().index,
color = sns.color_palette()[7])
sns.despine(bottom=True)
plt.title('Airlines cancelled flights counts (from highest) and proportions \n for the year 2008\n [Total 137,434]',
fontsize=21,color='#AD1FFF',y=1.03)#,x=.205
plt.xlabel('')
plt.ylabel('Carrier',weight='bold',fontsize=18)
plt.xticks([])
plt.yticks(color = sns.color_palette()[1],fontsize=12.5)
# Recalculating the flights_counts for carriers
carrier_flights_count = airline_cancelled['Carrier'].value_counts()
carrier_flights_count_whole = df_inv['Carrier'].value_counts() # to be used in calculating percentages of carrier flights
# counts with delays of 15 minutes or more
# against carrier total flights number
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
count = carrier_flights_count[label.get_text()]
count_whole = carrier_flights_count_whole[label.get_text()]
percentage_value = 100*count/count_whole
percentage_total = 100*count/137434
string = str(f"{count:,}") + " ({:0.2f}% of carrier's)\n{:0.2f}%".format(percentage_value,percentage_total)
# print the annotation next to the bar
plt.text(count+700, loc, s=string, va = 'center', color = 'black');#,fontsize=12.5);
print('\nYear 2008 cancelled flights aggregated data per Airlines\n\n(sorted from the highest total flights airline had)')
prop_aireline.sort_values('Flights',ascending=False).iloc[:,[0,5,6]]
# columns are sorted by total flights number of month
Year 2008 cancelled flights aggregated data per Airlines (sorted from the highest total flights airline had)
| Flights | Cancelled | Canc. prop (per airline flights) | |
|---|---|---|---|
| Southwest Airlines Co. | 1201754 | 12389 | 1.03 % |
| American Airlines Inc. | 604885 | 17440 | 2.88 % |
| Skywest Airlines Inc. | 567159 | 12436 | 2.19 % |
| American Eagle Airlines Inc. | 490693 | 18331 | 3.74 % |
| US Airways Inc. | 453589 | 6582 | 1.45 % |
| Delta Air Lines Inc. | 451931 | 6813 | 1.51 % |
| United Air Lines Inc. | 449515 | 10541 | 2.34 % |
| Expressjet Airlines Inc. | 374510 | 9992 | 2.67 % |
| Northwest Airlines Inc. | 347652 | 2906 | 0.84 % |
| Continental Air Lines Inc. | 298455 | 3702 | 1.24 % |
| Atlantic Southeast Airlines | 280575 | 5026 | 1.79 % |
| Pinnacle Airlines Inc. | 262208 | 7100 | 2.71 % |
| AirTran Airways Corporation | 261684 | 2236 | 0.85 % |
| Mesa Airlines Inc. | 254930 | 9219 | 3.62 % |
| Comair Inc. | 197607 | 6462 | 3.27 % |
| JetBlue Airways | 196091 | 3205 | 1.63 % |
| Alaska Airlines Inc. | 151102 | 2139 | 1.42 % |
| Frontier Airlines Inc. | 95758 | 303 | 0.32 % |
| Hawaiian Airlines Inc. | 61826 | 570 | 0.92 % |
| Aloha Airlines Inc. | 7800 | 42 | 0.54 % |
#prop['Canc. prop (per total cancelled-flights 137,434 of year 2008'] = round(prop['Cancelled']*100/137434 ,
# 2).astype(str) + ' %'
#prop.drop(columns='Canc. prop (per total cancelled-flights 137,434 of year 2008',inplace=True)
print('\nAirlines cancelled-flights proportions compared to all 137,434 total cancelled-flights of year 2008 \n\n\
(sorted from the highest proportion)')
col = 'Canc. prop (per total 137,434 cancelled-flights of year 2008)'
round(prop_aireline.sort_values('Flights',ascending=False).iloc[:,[0,5,6]]['Cancelled']*100/137434 ,
2) \
.to_frame(col) \
.sort_values(col,ascending=False).astype(str) + ' %'
# columns are sorted by total flights number of month
Airlines cancelled-flights proportions compared to all 137,434 total cancelled-flights of year 2008 (sorted from the highest proportion)
| Canc. prop (per total 137,434 cancelled-flights of year 2008) | |
|---|---|
| American Eagle Airlines Inc. | 13.34 % |
| American Airlines Inc. | 12.69 % |
| Skywest Airlines Inc. | 9.05 % |
| Southwest Airlines Co. | 9.01 % |
| United Air Lines Inc. | 7.67 % |
| Expressjet Airlines Inc. | 7.27 % |
| Mesa Airlines Inc. | 6.71 % |
| Pinnacle Airlines Inc. | 5.17 % |
| Delta Air Lines Inc. | 4.96 % |
| US Airways Inc. | 4.79 % |
| Comair Inc. | 4.7 % |
| Atlantic Southeast Airlines | 3.66 % |
| Continental Air Lines Inc. | 2.69 % |
| JetBlue Airways | 2.33 % |
| Northwest Airlines Inc. | 2.11 % |
| AirTran Airways Corporation | 1.63 % |
| Alaska Airlines Inc. | 1.56 % |
| Hawaiian Airlines Inc. | 0.41 % |
| Frontier Airlines Inc. | 0.22 % |
| Aloha Airlines Inc. | 0.03 % |
# cancellation causes per airline
plt.figure(figsize=(12,15))
ax = sns.countplot(data=airline_cancelled,
y='Carrier',order = airline_cancelled['Carrier'].value_counts().index,
hue='CancellationCode',hue_order=['A','B','C','D'],
palette='bright');
sns.despine(top=False,bottom=True)
title = 'Flights cancellation-causes frequencies by airlines \nfor the year 2008'
plt.title(title + '\n(from highest total cancelled flights per airline)',
fontsize=21,color='#BF1B00',weight='bold',y=1.14)
plt.xlabel('')
plt.ylabel('Carrier',fontsize=18,weight='bold')
ax.xaxis.tick_top()
plt.xticks(np.arange(0,12000,2000),
[0] + [str(n)+'k' for n in range(2,11,2)],
#[0,'2k','4k','6k','8k','10k'],
fontsize=14,weight='bold')
plt.yticks(fontsize=14,weight='bold')
plt.legend(title='Cancellation causes',title_fontsize=18,
loc='upper center',#loc='upper right',
#edgecolor='black', #prop={'size': 22},
labels=['Carrier', 'Weather', 'National Air System', 'Security'],ncol = 4,
#bbox_to_anchor=(1.2,.66),
bbox_to_anchor=(.5,1.12),framealpha = .8,fontsize=14,prop={'size':16},columnspacing=.8)
# drawing max and min lines
max_min_lines = {'carrier':(9609,42),
'weather':(8408,0),
'NAS':(6275,0)}
colors = (color for color in list(sns.color_palette()))
for m in max_min_lines:
max,min = max_min_lines[m]
color=next(colors)
plt.axvline(max,color=color,linewidth=1.7, linestyle ='-')
plt.axvline(min ,color=color,linewidth=1.7, linestyle ='-');
#'-', '--', '-.', ':', 'None', ' ', '', 'solid', 'dashed', 'dashdot', 'dotted'
print('\nHint: Colored lines indicate the Max and Min frequencies');
#plt.grid(True,axis='x');
Hint: Colored lines indicate the Max and Min frequencies
# # univariate plots of CancellationCode or flight cancellation-causes codes within Carriers
# for the whole data points df_inv (7,009,728 rows)
# A = carrier, B = weather, C = NAS, D = security
sns.set(font_scale=1.19,style='whitegrid')
g = sns.FacetGrid(data=df_inv,col='Carrier'
,col_wrap=4
)#,xlim=(0,300))
g.fig.suptitle('Flights-cancellation causes frequencies per Carriers\nfor year 2008 flights\n',fontsize=18)
order = df_inv['CancellationCode'].value_counts().sort_index().index
g.map(sns.countplot,'CancellationCode', order = order,
hue=df_inv['CancellationCode'],hue_order= order,
color = sns.color_palette()[0],palette='muted')
g.set_titles('{col_name}')
g.add_legend(labels=['Carrier', 'Weather', 'National Air System', 'Security'])
g.set_ylabels('Freq.');#,color='blue',fontsize=15);
yticklabels = [str(n/1000) + 'k' for n in g.axes[0].get_yticks()]
g.set_yticklabels(labels = [0] +yticklabels[1:],fontsize=15.5)
g.set_xlabels('Cancellation Causes')
g.set_xticklabels([cause.replace('\n',' ') for cause in causes],
rotation=90);
# flights cancellation causes times per airlines
print('\nFlights cancellation-causes times per Airlines\n\n(Sorted from highest total cancellation times)')
airline_cancelled_dummies = pd.get_dummies(data=airline_cancelled,
columns=['CancellationCode'],
prefix='',
prefix_sep='')
airline_cancelled_dummies['Total flights-cancellation causes times'] = \
airline_cancelled_dummies['A'] + airline_cancelled_dummies['B'] + \
airline_cancelled_dummies['C'] + airline_cancelled_dummies['D']
airline_cancelled_dummies.drop(columns='Month',inplace=True)
airline_cancelled_dummies.rename(columns={'A':'Carrier cancellation',
'B':'Weather cancellation',
'C':'National Air system cancellation',
'D':'Security cancellation'},
inplace=True)
airline_cancelled_dummies.groupby('Carrier').sum()\
.sort_values(by='Total flights-cancellation causes times',ascending=False)\
#.iloc[:,[0,1,2,3]]\
#.plot(kind='barh',figsize=(15,15));
Flights cancellation-causes times per Airlines (Sorted from highest total cancellation times)
| Carrier cancellation | Weather cancellation | National Air system cancellation | Security cancellation | Total flights-cancellation causes times | |
|---|---|---|---|---|---|
| Carrier | |||||
| American Eagle Airlines Inc. | 3644.0 | 8408.0 | 6275.0 | 4.0 | 18331.0 |
| American Airlines Inc. | 9609.0 | 5495.0 | 2334.0 | 2.0 | 17440.0 |
| Skywest Airlines Inc. | 3162.0 | 4748.0 | 4525.0 | 1.0 | 12436.0 |
| Southwest Airlines Co. | 6554.0 | 5167.0 | 666.0 | 2.0 | 12389.0 |
| United Air Lines Inc. | 6352.0 | 2987.0 | 1202.0 | 0.0 | 10541.0 |
| Expressjet Airlines Inc. | 697.0 | 5424.0 | 3871.0 | 0.0 | 9992.0 |
| Mesa Airlines Inc. | 5020.0 | 1766.0 | 2433.0 | 0.0 | 9219.0 |
| Pinnacle Airlines Inc. | 4329.0 | 1932.0 | 838.0 | 1.0 | 7100.0 |
| Delta Air Lines Inc. | 3113.0 | 2283.0 | 1417.0 | 0.0 | 6813.0 |
| US Airways Inc. | 3857.0 | 1131.0 | 1594.0 | 0.0 | 6582.0 |
| Comair Inc. | 1328.0 | 5016.0 | 118.0 | 0.0 | 6462.0 |
| Atlantic Southeast Airlines | 1384.0 | 2283.0 | 1357.0 | 2.0 | 5026.0 |
| Continental Air Lines Inc. | 746.0 | 2878.0 | 78.0 | 0.0 | 3702.0 |
| JetBlue Airways | 665.0 | 2445.0 | 95.0 | 0.0 | 3205.0 |
| Northwest Airlines Inc. | 1119.0 | 901.0 | 886.0 | 0.0 | 2906.0 |
| AirTran Airways Corporation | 815.0 | 976.0 | 445.0 | 0.0 | 2236.0 |
| Alaska Airlines Inc. | 1222.0 | 865.0 | 52.0 | 0.0 | 2139.0 |
| Hawaiian Airlines Inc. | 563.0 | 7.0 | 0.0 | 0.0 | 570.0 |
| Frontier Airlines Inc. | 109.0 | 192.0 | 2.0 | 0.0 | 303.0 |
| Aloha Airlines Inc. | 42.0 | 0.0 | 0.0 | 0.0 | 42.0 |
# ArrDelay means per Airlines
# airlines_arr_means = df_inv.loc[~(df_inv['ArrDelay'].isin(zero_or_null)),['Carrier','ArrDelay']] \
# .groupby('Carrier').mean() \
# .sort_values(by='ArrDelay',ascending=False)
airlines_arr_means = df_inv[~(df_inv['ArrDelay'].isin(zero_or_null))][['Carrier','ArrDelay']]\
.groupby('Carrier').mean()\
.sort_values(by='ArrDelay',ascending=False)
plt.figure(figsize=[10,12])
sns.barplot(data=airlines_arr_means, y=airlines_arr_means.index , x='ArrDelay')#,linestyles='')
sns.despine(left=True,bottom=True)
sns.set_theme(font_scale=1.19,style='ticks')
plt.suptitle('Arrival delays (15 min. or more) means per airlines \n for the year 2008 flights',fontsize=21,weight='bold')
plt.ylabel('Airline',fontsize=18,weight='bold')
plt.title('Arrival delay mean (min.)',fontsize=18,weight='bold')
plt.xticks([])
plt.xlabel('')
#plt.ylim((0,18))
#plt.grid(True)
loc , labels = plt.xticks(fontsize=13.5)
loc , labels = plt.yticks(fontsize=13.5)
plt.axvline(color='black')
# get the current tick locations and labels
locs, labels = plt.yticks(weight='bold')
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
mean = airlines_arr_means.loc[label.get_text(),'ArrDelay']
string = '{:0.2f} min.'.format(mean)
# print the annotation next to the bar
plt.text(mean+.3, loc, s=string, va = 'center', color = 'black',fontsize=13,weight='bold');
In general, arrival delays due to 'Weather' had the highest medians, with the widest third quartile ranges starting from those medians, and higher greater-values distributions, within almost all carriers, notably in 'JetBlue Airways', except in case of 'Frontier Airlines Inc.' where flights arrival delays due to 'Security' went up with the highest median and the widest second quartile range, and higher greater-values distribution.
Then, also in general, arrival delays due to 'Carrier' and 'NAS' or national air system, were swapping the second and the third highest medians, as well lower-values distributions except in few carriers where arrival delays due to 'Security' may had greater medians and more greater-values distributions, like in 'United Air Lines Inc.', 'Delta Air Lines Inc.' and 'Aloha Airlines Inc.'.
plt.figure(figsize = [15,8])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.pointplot(data = melt_15, x = 'Carrier', y = 'ArrDelay', hue = 'Cause',
order=airlines_arr_means.index[::-1],
linestyles='dotted',dodge= 0.237,scale=1.15
,palette = 'rocket_r')
title='Arrival delays (15 min. or more) means by delay-causes\n'
ax.set_title(title+ 'per Airlines (from lowest total arrival delay mean)\n for the year 2008',
fontsize=21,weight='bold',y=1.165)
ax.set_xlabel('Airline',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay mean (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.xticks(fontsize=15,weight='bold',rotation=90)
plt.yticks(fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large')#,color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
# ArrDelay means by causes per Carriers for flights with 15+ minutes arrival delay
sns.set(font_scale=1.44,style='whitegrid')
g = sns.FacetGrid(data=melt_15,col='Carrier',
col_order=airlines_arr_means.index[::-1],
col_wrap=5,
height=4)#,xlim=(0,300))
title='Arrival Delay (15+ minutes) means by delay causes per Carriers (from lowest total arrival delay mean)\n'
g.fig.suptitle(title+'for year 2008 flights',
fontsize=24,x=0.331,y=1.06)
g.map(sns.pointplot,'Cause','ArrDelay', order=melt_15.Cause.unique(),
hue=melt_15['Cause'],hue_order=melt_15.Cause.unique(),
palette = 'rocket_r',
dodge=.7)#,linestyles='',scale=1.15,
g.set_titles('{col_name}',weight='bold')
g.add_legend(title='Delay causes',#title_fontsize =25,
loc='upper center',
ncol=4,
prop={'size':20},
bbox_to_anchor=(0.331,1.015),
frameon=True,framealpha=.8)
plt.setp(g._legend.get_title(),fontsize=22)
g.set_xticklabels(rotation=90)#,fontsize=13.5)
#g.set_yticklabels(g.axes[0].get_yticks().astype('int'),fontsize=13.5)
g.set_xlabels('Cause',weight='bold')
g.set_ylabels('Arrival delay mean (min.)',weight='bold');#,color='blue',fontsize=15);
print('\nFlights delays means per months and causes')
columns={'ArrDelay':'Arrival delay mean (min.)','Minutes':'Cause delay mean (min.)'}
melt_15.groupby(['Carrier','Cause']).mean().rename(columns=columns)
#melt_15[list(set(melt_15.columns) -{'Cancelled'})].groupby(['Carrier','Cause']).mean().rename(columns=columns)
Flights delays means per months and causes
| Arrival delay mean (min.) | Cause delay mean (min.) | ||
|---|---|---|---|
| Carrier | Cause | ||
| AirTran Airways Corporation | Carrier | 42.115972 | 31.822994 |
| National Air System | 53.588426 | 26.499704 | |
| Weather | 76.769874 | 51.468619 | |
| Alaska Airlines Inc. | Carrier | 56.899871 | 38.939147 |
| National Air System | 46.367842 | 19.703027 | |
| ... | ... | ... | ... |
| US Airways Inc. | Weather | 85.449527 | 40.166798 |
| United Air Lines Inc. | Carrier | 66.042223 | 36.718466 |
| National Air System | 59.105156 | 29.436918 | |
| Security | 49.883721 | 38.627907 | |
| Weather | 91.000000 | 41.386162 |
79 rows × 2 columns
plt.figure(figsize = [15,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.boxplot(data = melt_15, x = 'Carrier', y = 'ArrDelay', hue = 'Cause',
order=airlines_arr_means.index[::-1],
dodge= 0.2,#scale=1.15 ,linestyles='',
palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes boxplots\n per Airlines (from lowest total arrival delay mean)'
ax.set_title(title+'\nfor the year 2008',fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Airline',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.xticks(fontsize=15,weight='bold',rotation=90)
plt.yticks(np.arange(0,3000,500),
[0] + [str(n)+'k' for n in np.arange(0.5,3,0.5)],
#[0] + [str(n/10)+'k' for n in range(5,30,5)],
fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large', color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
plt.figure(figsize = [18,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.boxplot(data = melt_15, x = 'Carrier', y = 'ArrDelay', hue = 'Cause',
order=airlines_arr_means.index[::-1],
dodge= 0.2,#scale=1.15 ,linestyles='',
palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes boxplots\n per Airlines (from lowest total arrival delay mean)'
ax.set_title(title +'\nfor the year 2008'+ '\n(limiting to 200 min.)',
fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Airline',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.ylim(0,200)
plt.xticks(fontsize=15,weight='bold',rotation=90)
plt.yticks(fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large',color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
# # bivariate plots of Arrival-delays (15+ min., limited to 150) boxplots per delay causes by Carriers for year 2008 flights
# 1.5+ million flights, 20 Carriers
sns.set(font_scale=1.19,style='whitegrid')
order = melt_15['Cause'].unique()
# g = sns.FacetGrid(data=melt_15,col='Month',
# col_wrap=4,
# hue='Cause',hue_order= order,palette='pastel'
# )#,xlim=(0,300))
g = sns.catplot(kind='box',
data=melt_15,col='Carrier', col_order=airlines_arr_means.index[::-1],
col_wrap=5,
x='Cause',order=order,
y='ArrDelay',
hue='Cause',hue_order= order,palette='pastel',#'rocket_r',
height=3.5,facet_kws=dict(ylim=(0,200))
)
title = 'Arrival-delays (15+ min., limited to 150) boxplots by delay causes'
g.fig.suptitle(title+'\nper Carriers (from lowest total arrival delay mean)'+'\nfor year 2008 flights',
color='#BF1B00',
fontsize=21,
x=0.35,
y=1.15)
# g.map(sns.barplot,'Cause','Minutes',order=order,)
g.set_titles('{col_name}',fontsize=18,weight='bold')
g.add_legend(title = 'Delay causes', #labels=['Carrier', 'Weather', 'National Air System', 'Security'])
loc='upper center',
ncol=4,
fontsize=15,
bbox_to_anchor=(0.35,1.073),
frameon=True,framealpha=.8)
plt.setp(g._legend.get_title(),fontsize=16)
g.set_xlabels('Cause',weight='bold',fontsize=14)#
g.set_xticklabels([cause.replace('\n',' ') for cause in causes],
rotation=90,fontsize=13.7)#,weight='bold'
g.set_ylabels('Arrival delay (min.)',weight='bold',fontsize=14);
ytickslabels = (g.axes[0].get_yticks()).astype('int')
#ytickslabels = [0] + [str(n/1000)+'k' for n in ytickslabels[1:]]
g.set_yticklabels(ytickslabels,fontsize=13.7);#fontsize=12,color='blue',weight='bold';
plt.figure(figsize = [15,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.violinplot(data = melt_15, x = 'Carrier', y = 'ArrDelay', hue = 'Cause',
#order=airlines_arr_means.index,
dodge= 0.2,#scale=1.15 ,linestyles='',
)#palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes violinplots\n per Airlines' # (from highest total arrival delay mean)'
ax.set_title(title+'\nfor the year 2008',fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Airline',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.xticks(fontsize=15,weight='bold',rotation=90)
plt.yticks(np.arange(0,3000,500),
[0] + [str(n)+'k' for n in np.arange(0.5,3,0.5)],
#[0] + [str(n/10)+'k' for n in range(5,30,5)],
fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large', color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
plt.figure(figsize = [18,10])
#plt.suptitle('Arrival delays means by delay causes \n per Airlines\n for the year 2008',fontsize=21,weight='bold')
ax = sns.violinplot(data = melt_15, x = 'Carrier', y = 'ArrDelay', hue = 'Cause',
order=airlines_arr_means.index[::-1],
dodge= 0.2,#scale=1.15 ,linestyles='',
)#palette = 'rocket_r')
title = 'Arrival delay (15 min. or more) by delay-causes violinplots\n per Airlines (from lowest total arrival delay mean)'
ax.set_title(title + '\n for the year 2008' + '\n(limiting to 200 min.)',
fontsize=21,weight='bold',y=1.18)
ax.set_xlabel('Airline',color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',y=-.2)
ax.set_ylabel('Arrival delay (min.)', color=sns.color_palette('viridis')[-3],
fontsize=18,weight='bold',x=-.2)
plt.ylim(-75,200)
plt.xticks(fontsize=15,weight='bold',rotation=90)
plt.yticks(fontsize=15,weight='bold')
#ax.xaxis.tick_top()
legend = ax.legend(title = 'Delay causes', ncol = 4, framealpha = .8,
loc='upper center',bbox_to_anchor=(.5,1.16),
# labels=causes,
fontsize=14,prop={'size':16},columnspacing=.8)
plt.setp(legend.get_title(),fontsize='x-large',color=sns.color_palette('viridis')[-3])
sns.despine(left=True,bottom=True)
plt.grid(True);
# # bivariate plots of Arrival-delays (15+ min., limited to 150) violinplots per delay causes by Carriers for year 2008 flights
# 1.5+ million flights, 20 Carriers
sns.set(font_scale=1.19,style='whitegrid')
order = melt_15['Cause'].unique()
# g = sns.FacetGrid(data=melt_15,col='Month',
# col_wrap=4,
# hue='Cause',hue_order= order,palette='pastel'
# )#,xlim=(0,300))
g = sns.catplot(kind='violin',
data=melt_15,col='Carrier', col_order=airlines_arr_means.index[::-1],
col_wrap=5,
x='Cause',order=order,
y='ArrDelay',
hue='Cause',hue_order= order,palette='pastel',#'rocket_r',
height=3.5,facet_kws=dict(ylim=(0,175))
)
title = 'Arrival-delays (15+ min., limited to 175) violinplots by delay causes'
g.fig.suptitle(title+'\nper Carriers (from lowest total arrival delay mean)'+'\nfor year 2008 flights',
color='#BF1B00',
fontsize=21,
x=0.35,
y=1.15)
# g.map(sns.barplot,'Cause','Minutes',order=order,)
g.set_titles('{col_name}',fontsize=18,weight='bold')
g.add_legend(title = 'Delay causes', #labels=['Carrier', 'Weather', 'National Air System', 'Security'])
loc='upper center',
ncol=4,
fontsize=15,
bbox_to_anchor=(0.35,1.073),
frameon=True,framealpha=.8)
plt.setp(g._legend.get_title(),fontsize=16)
g.set_xlabels('Cause',weight='bold',fontsize=14)#
g.set_xticklabels([cause.replace('\n',' ') for cause in causes],
rotation=90,fontsize=13.7)#,weight='bold'
g.set_ylabels('Arrival delay (min.)',weight='bold',fontsize=14);
ytickslabels = (g.axes[0].get_yticks()).astype('int')
#ytickslabels = [0] + [str(n/1000)+'k' for n in ytickslabels[1:]]
g.set_yticklabels(ytickslabels,fontsize=13.7);#fontsize=12,color='blue',weight='bold';
1**st**: **related to arrival delays and delay causes in general**⮚
➷ 'Carrier' as well 'Weather' were the most common to cancel a flight.
➷ Flights-cancellation due 'Security' doesn't seem to be the common.
➷ Most arrival delays were of 15 minutes to about 78 minutes.
➷ More higher frequencies were for "Carrier" and "National Air System" lower-delays values than other delay-causes delays.
➷ 'Carrier' delays exceeded with the highest spread and outliers.
➷ 'NAS' delays had more lower-values frequencies.
➷ In general, Weather-delays had the highest mean, median, the wider IQR, and a wider range from its higher median to its third quartile over other-causes delays.
➷ Flights that had weather-delays were probably to have the most arrival delays in general.
➷ Arrival delays due to 'Weather' had a higher mean, median (which is about 62 minutes), with a larger third quartile range, and a more wider greater values distribution from about 55 minutes to about 225 minutes than arrival delays due to other causes.
➷ Much more frequencies for arrival delays due to 'NAS'.
➷ For the whole flights of 15+ min. arrival delay, in general, arrival-delays due to 'Security' had the lowest median, ranges, as well outliers.
➷ For all delayed or flights of 15+ arrival delay, there were some kind of linearity between arrival delays and delays of different delay-causes, at some point before 250 minutes of causes-delays (and about 50 minutes for sampled data).
➷ In general, arrival delays had more stronger moderate correlation with carrier-delays.
➷ 'Carrier' and 'NAS'-delays had more higher correlation than 'Weather'-delays with arrival delays.
2**nd**: **related to 'Months'**⮚
➷ All 2008 months had close-ratios of recorded total flights frequencies.
➷ Months 'Feb', 'Dec', 'Jan', and 'Mar', had the highest flights-cancellation counts.
➷ 'May', 'Nov', and 'Oct' had the lowest flights-cancellation counts.
➷ 'Feb' exceeded in the ratio of cancelled flights, then 'Dec' and 'Jan', which are notable to be of 'Winter' season.
➷ "Weather" had the highest impact to cancel flights within months 'Feb', 'Dec', 'Jan', 'Mar', as well 'Sep', almost for half of the year.
➷ Months 'Dec', 'Jun', 'Feb', and 'Mar' had the highest arrival delays means.
➷ 'Sep' and 'Nov' had the lowest arrival delays means.
➷ 'Weather' had the highest impact on flights arrival delays within different 2008 months.
3**rd** **related to 'Carriers'**⮚
➷ 'Southwest Airlines Co.' recorded the highest total flights count within 2008; more than twice of carrier's in next order.
➷ Carriers 'American Eagle Airlines Inc.', 'American Airlines Inc.', 'Skywest Airlines Inc.', 'Southwest Airlines Co.', 'United Air Lines Inc.', and 'Expressjet Airlines Inc.', had the highest flights-cancellation counts.
➷ 'Hawaiian Airlines Inc.', 'Frontier Airlines Inc.', and 'Aloha Airlines Inc.' had the lowest flights-cancellation counts.
➷ "Carrier" procedures had the highest impact to cancel flights within 11 Airlines carriers of 20.
➷ "Weather" had the highest impact to cancel flights within 9 Airlines carriers of 20.
➷ "NAS" or National Air System had the second highest impact to cancel flights within 5 Airlines carriers of 20.
➷ 'American Airlines Inc.' had more cancellations due to 'Carrier'.
➷ 'Hawaiian Airlines Inc.'s flights-cancellation were because of 'Carrier', as a majority.
➷ 'Aloha Airlines Inc.'s 42 cancelled-flights were because of 'Carrier'.
➷ Carriers 'American Airlines Inc.', 'Mesa Airlines Inc.', 'Comair Inc.', 'United Air Lines Inc.', 'JetBlue Airways', and 'Continental Air Lines Inc.', had the highest arrival delays means.
➷ 'Pinnacle Airlines Inc.', 'US Airways Inc.', and 'Hawaiian Airlines Inc.', had the lowest arrival delays means.
➷ Arrival delays mean of 'Aloha Airlines Inc.' was about negative 3 minutes, indicating almost no arrival delays.
➷ On average, 'Weather' had the highest impact on flights arrival delays within almost all carriers, notably in 'JetBlue Airways', except in case of 'Frontier Airlines Inc.'
➷ Arrival delays due to 'Security' may had greater medians and more greater-values distributions, like in 'United Air Lines Inc.', 'Delta Air Lines Inc.' and 'Aloha Airlines Inc.'.